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Introduction 


In 1990, it was estimated that Cray Research's installed base of 
approximately 200 machines spent 40% of all CPU cycles computing the 
fast Fourier transform (FFT) [link]. With each machine worth about 
USD$25 million, the performance of the FFT was of prime importance. 


Today, use of the FFT is even more pervasive, and it is counted among the 
10 algorithms that have had the greatest influence on the development and 
practice of science and engineering in the 20" century [link]. Huge numbers 
of mobile smartphones, tablets and PCs [link], [link], most of which are 
equipped with single-instruction, multiple-data (SIMD) [link], [link] 
microprocessors, compute the FFT on a large scale for a plethora of sound, 
video and image processing applications. In the space of a few years, 
mobile applications have become a part of many people's everyday 

lives [link]. 


This thesis shows that the key to optimizing the performance of the split- 
radix FFT algorithms on SIMD microprocessors is latency and spatial 
locality optimizations, and in some cases, a variant of the split-radix FFT 
called the conjugate-pair algorithm [link], [link], [link], [link]. It is also 
shown that extensive machine specific calibration may be superfluous. 


Hypotheses 


FFTW [link], [link], [link], SPIRAL [link], [link], [link] and UHFFT [link], 
[link], [link], [link], [link] are state of the art FFT libraries that employ 
automatic empirical optimization. SPIRAL automatically performs 
machine-specific optimizations at compile time, and FFTW and UHFFT 
automatically adapt to a machine at run-time. Aside from the use of 
automatic optimization, a common denominator among these libraries is the 
use of large straight line blocks of code and optimized memory locality. 


The hypotheses outlined below test whether good heuristics and model- 
based optimization can be used in the place of automatic empirical 
optimization. 


Hypothesis 1: Accessing memory in sequential “streams” is critical for 
best performance 


Large FFT exhibit poor temporal locality, and when computing these 
transforms on microprocessor based systems that feature a cache, best 
performance is typically achieved when “streaming” sequential data 
through the CPU. Hypothesis 1 is tested in Implementation Details with 
replicated coefficient lookup tables that trade-off increased memory size for 
better spatial locality, and in Streaming FFT by topologically sorting a 
directed acyclic graph (DAG) of sub-transforms to again improve spatial 
locality. 


Hypothesis 2: The conjugate-pair algorithm is faster than the ordinary 
split-radix algorithm 


Hypothesis 2 is based on the idea that memory bandwidth is a bottleneck, 
and on the fact that the conjugate-pair algorithm requires only half the 
number of twiddle factor loads. This hypothesis is tested in Split-radix vs. 
conjugate-pair, where a highly optimized implementation of the conjugate- 
pair algorithm is benchmarked against an equally highly optimized 
implementation of the ordinary split-radix algorithm. 


Hypothesis 3: The performance of an FFT can be predicted based on 
characteristics of the underlying machine and the compiler 


Exploratory experiments suggest that good results can be obtained without 
empirical techniques, and that certain parameters can be predicted based on 
the characteristics of the underlying machine and the compiler used. 
Hypothesis 3 is tested in Results and Discussion by building a model that 
predicts performance, and by benchmarking FFTW against an 
implementation that does not require extensive calibration, on 18 different 
machines. 


Scope 


In investigating the hypotheses, the scope of this work has been limited in 
several ways: 


1% 


It is limited to single-threaded complex 1D FFTs, because multi- 
dimensional, multi-threaded or multi-processor FFTs (or any 
combination thereof) are ultimately decomposed into 1D components 
running on a single core, and all other things being equal, it is the 
performance of these 1D components running on a single 
microprocessor core that determines the overall performance of a 
given multi-threaded implementation; 


. It is limited to transforms that operate on vectors of length 2 where 


€ ° because these are the easiest to compute on machines, and 
consequently the most often used by applications. This excludes the 
prime-factor algorithm [link], [link], and the Radar [link] and 
Bluestein [link], [link], [link] algorithms for prime sizes; 


. It is limited to the split-radix [link], [link], [link], [link], [link] and 


conjugate-pair [link], [link], [link], [link] algorithms. The Winograd 
algorithm [link], [link], [link], [link] is excluded because of its low 
performance on systems where multiplication costs about the same as 
addition; 


. It is limited to out-of-place transforms, because they are generally 


faster than in-place transforms, except at the boundaries of the 
cache [link]; 


. The benchmark experiments are limited to the Intel x86 and ARM 


machines, because it is estimated that 92% of the microprocessors in 
the rapidly expanding mobile market are ARM devices [link], while 
Intel's share of the worldwide PC and mobile PC microprocessors 
markets is estimated to be 79.3% and 84.4%, respectively [link]. 


Contributions 


The contributions of this work are summarized as follows: 


ie 


Z. 


Three methods of computing the conjugate-pair algorithm on SIMD 
microprocessors are described in Streaming FFT; 

The source code for the high-performance SIMD FFT library 
developed in this thesis is publicly available under a permissive open 


source licence on github. 


Organization 


This work is divided into two parts. The first part, Chapters 1-4, 
encompasses the relevant background, while the second part, Chapters 5-8, 
is concerned with contributions that challenge the state of the art. 


A brief overview of the contents of each chapter: 


ee 


Di 


Algorithms provides an overview of FFT algorithms from the 
mathematical perspective; 

Implementation details complements the mathematical perspective of 
the previous chapter with a more focused view of the low level details 
that are relevant to efficient implementation on SIMD 
microprocessors; 

Existing libraries reviews existing state of the art libraries, with 
reference to algorithms and implementation details of the previous 
chapters; 


. Streaming FFT describes SFFT, a library for SIMD microprocessors 


that is, in many cases, faster than the state of the art FFT libraries 
reviewed in Existing libraries; 


. Benchmark methods describes the benchmarking methods used to 


evaluate performance and accuracy of various FFT implementations 
throughout this work; 


. Results and discussion presents the results of benchmarks on 18 


different machines, as well as the results of model-based optimization 
experiments, with reference to earlier chapters and other related work; 


. Conclusions and future work concludes the work with a review of the 


hypotheses, a summary of the contributions, and some idea for 
directions that future work might take. 


Algorithms 


Efficient computation of the fast Fourier transform (FFT) requires an understanding of the computation at every 
level of abstraction, from the high-level algorithmic view down to the low-level details of the target machine (or 
failing that, a lot of time to code all known FFT algorithms and exhaustively search the configuration space). This 
chapter provides an overview of FFT from the mathematical perspective. 


Fast Fourier transform algorithms are derived from the discrete Fourier transform (DFT), which is formally 
defined as [link]: 


Equation: 
N-1 
k 
X,= > wi Ln, 
n=0 
where k = 0,---, N — 1 and wy is the primitive root-of-unity, defined as e~2"V~!/% (often referred to as a 


“twiddle factor” in the context of fast Fourier transforms). X; and x, are sequences of complex numbers, X;, 
being the outputs in the frequency domain, and z,, being the inputs in the time or space domain. 


A source of mild confusion in the FFT literature is the sign of the twiddle factor [link]; the definition in [link] is 
considered to be the engineers view of the discrete Fourier transform, where the goal is to compute the coefficients 
of a discrete Fourier series. Mathematicians, on the other hand, typically view the DFT as a method of evaluating a 
polynomial at the powers of a primitive root of unity, and thus consider [link] to be an inverse DFT [link]. Cooley 
and Tukey [link], Fiduccia [link] and Bernstein [link] are notable examples of those who adopt the mathematicians 
convention. This work adopts the engineer's view of a DFT, and thus the inverse discrete Fourier transform (IDFT) 
is defined by the following equation: 


Equation: 
N-1 
1 = 
tn = 7 wW aX 
k=0 
where n = 0,---, N — 1. It should be noted that in some implementations, such as FFTW and the implementation 


presented in this thesis, the IDFT is actually non-normalised for reasons of efficiency; i.e., 
IF FT(FFT(z«)) = Nz, thus avoiding division of each of the samples in time by N [link]. 


Cooley-Tukey 


In 1965 James Cooley and John Tukey published a description of an economical algorithm for computing the DFT 
that became known as the Cooley-Tukey FFT, or simply the FFT due to its overwhelming popularity [link]. A later 
investigation by Heideman, Johnson and Burrus [link] revealed that the algorithm had actually been discovered 
several times in various forms prior to Cooley and Tukey, most notably by Gauss sometime around 1805 [link]. 


The algorithm recursively divides a transform of size N = N,N, into smaller DFTs of size N; and N2 (where N 
is highly composite), reducing the time complexity from O(n?) to O(n log n) by exploiting common factors. 


As the algorithm recursively divides a DFT, either Ny or N2 is typically a small factor, and is known as the radix. 
Small N, characterizes the algorithm as being decimation-in-time (DIT), otherwise the algorithm is decimation-in- 
frequency (DIF). If the radix changes between stages, then the algorithm is referred to as “mixed-radix'. 


For example, a radix-2 decimation-in-time algorithm decomposes [link] into a sum over the even indices (n = 2n2 
) and a sum over the odd indices (n = 2n2 + 1): 
Equation: 


N/2-1 N/2-1 
(2n2)k (2no+1)k 
X,= Wy Lan. + ) Wi Dn+1 
n2=0 


n2 =0 


The trigonometric coefficient in the second sum can be expanded to ww, and the term now common to both 
sums is simplified using the identity eu = Om" Because one of the trigonometric coefficients in the second 


sum is constant with respect to the index variable, it may be factored out to obtain: 
Equation: 


N/2-1 N/2-1 
nok k nok 
X= y Wy/9 Lon, + Wy ; Wi/2 Lon.41 


n2=0 n2=0 


where the two sums are now DFTs of the even indexed terms (#2, ) and the odd indexed terms (%2n,+1), which are 
combined with twiddle factor w,. 


In order to compute the transform more efficiently, the Cooley-Tukey algorithm divides X;, into two halves, and 
exploits the periodicity of sub-transforms and symmetries in the trigonometric coefficients. Firstly, [link] is 
rewritten as two halves with &; substituted for the even sub-transform, and O; substituted for the odd sub- 
transform: 

Equation: 


Xi = Eyt+ wk On 
k+N/2 
Xepw/2 = Epinjat Wy } Oxsn/2 


where k = 0,---,.N/2— 1. Because of the periodicity property of the outputs of a DFT, E;, = Exn/2 and 
Ox = Ox+n/2; [link] simplifies thus: 


Equation: 
Xk = Ex+wkO; 
k+N/2 
Xepw/2 = Et oe P Op 
And finally, by exploiting symmetries in the complex exponential function, namely that gee a —wh, the 
radix-2 DIT FFT can be expressed as: 
Equation: 
Xi, = Eyt+ wk Ox 
XE4N/2 = Ex- why Ox 


which makes it clear that each pair of outputs share common computation, approximately halving the number of 
arithmetic operations when compared to the DFT. But since the even and odd terms in [link] are themselves DFTs 
that can be computed with the FFT, the savings compound with each stage of recursion. The total number of real 
arithmetic operations required to compute the radix-2 FFT can be expressed with the following recurrence relation: 
Equation: 


T(n) = 2T(n/2)+5n—-6 forn>2 
aad Ki forn=1 


which is in O(n log n). 


Split-radix 


In 1968 a derivitive of the Cooley-Tukey algorithm broke the record for the lowest number of arithmetic 
operations for computing the DFT [link], [link], [link]. The algorithm was initially discovered by Yavne [link], but 
was not widely cited until 1984 when it was rediscovered by Duhamel and Hollman [link] and became known as 
the split-radix algorithm. 


The split-radix algorithm improves the arithmetic complexity of the Cooley-Tukey algorithm by further 
decomposing the odd parts into odd-odd and odd-even parts, while the even parts are left alone because they have 
no multiplicative factor. More formally, [link] can be re-written as three sums: 


Equation: 
ae 1 N/4-1 N/4-1 
2no2k ene ern 
ap Wy L2n, + ye Langti + Sw LAng+3 
n4a= =0 


where n = 4n4 = 2n2. As with the Cooley-Tukey radix-2 example in !"Cooley-Tukey", the trigonometric 
coefficients are expanded and simplified, and the terms constant with respect to the index variables factored out: 
Equation: 


N/2-1 N/4-1 N/4-1 
=) k ) nak 3k ) nak 
whe L2n2 + Why Wn /4 LAng+1 7 Why Wn /4 LAny+3 
n2=0 n4=0 n4=0 


By substituting the even sum with U;, (where k = 0,---, N/2— 1) and the odd sums with Z;, and Z;, (where 
k =0,---,.N/4 — 1), [link] is simplified: 
Equation: 


X, =U, +05, 2,4 w¥Z, 


Computation can be factored out of [link] by again exploiting periodicity in the sub-transforms and symmetries in 
the twiddle factors. [link] is first expressed as an equation of four parts: 


Equation: 
X; = Ux wh Zi, 4 T wit Zt 
k+N/2 3(k+N/2 
XEN /2 = Upsn/2 TWy : ZEN /2 abe wy ; ee 
k+N/4 3(k+N/4 
Xein/4 = Ubsinjat+on : Zern/at wit ! Dos 
3(k+3N/4 
Xei3w/4 = Upisnya +wy ee Z443N/4 + Wy oo Zi k4+3N/4 
where & = 0,---,.N/4 — 1. The periodicity properties of the sub-transforms can be expressed with the 


relationships U;, = Un+N/25 i ZetN/A and Zi, = = Fis wya: These are used to simplify [link] thus: 
Equation: 


Xz = U;, + wh Z, + wt Zh 


XkiN/2 => U;, a we + ig? 
XkiN/4 — Uninsa i iad gee Zk + on) Zi, 
Xni3n/4 = Unin/4 oi. en oe es Ar aaa Zi 


Symmetries in the complex exponential function are again used to expose common computation among each part 
of the equation; hence 


Equation: 
Xk = Unt (wh 2, + wZ7) 
Xuan = Un— (wh Zn-+wHZ)) 
Xana = Vana —i (wh Ze — wt Z}) 
Xpsan/s = Unewjati (wh Zp — wi? Zi) 


which, when recursively applied to the sub-transforms, results in the following recurrence relation for real 
arithmetic operations: 
Equation: 


T(n/2) + 2T(n/4)+6n—-—4 forn>2 
0 forn=1 


P(n)={ 


The exact solution T (n) = 4n log, n — 6n + 8 for n > 2 was the best arithmetic complexity of all known FFT 
algorithms for over 30 years, until Van Buskirk was able to break the record in 2004 [link], as described in 
“Tangent”. 


Van Buskirk's arithmetic complexity breakthrough was based on a variant of the split-radix algorithm known as the 
“conjugate-pair” algorithm [link] or the “—1 exponent” split-radix algorithm [link], [link]. In 1989 the conjugate- 
pair algorithm was published with the claim that it had broken the record set by Yavne in 1968 for the lowest 
number of arithmetic operations for computing the DFT [link]. Unfortunately the reduction in the number of 
arithmetic operations was due to an error in the author's analysis, and the algorithm was subsequently proven to 
have an arithmetic count equal to the original split-radix algorithm [link], [link], [link]. Despite initial claims about 
the arithmetic savings being discredited, the conjugate-pair algorithm has been used to reduce twiddle factor loads 
in software implementations of the FFT and fast Hartley transform (FHT) [link], and the algorithm was also 
recently used as the basis for an algorithm that does reduce the arithmetic operation count, as described in 
Tangent". 


The difference between the conjugate-pair algorithm and the split-radix algorithm is in the decomposition of odd 
elements. In the standard split-radix algorithm, the odd elements are decomposed into two parts: %4,,,41 and 
LAn4+3 (see [link]), while in the conjugate-pair algorithm, the last sub-sequence is cyclically shifted by —4, where 
negative indices wrap around (i.e., x_1 = Zy_1). The result of this cyclic shift is that twiddle factors are now 
conjugate pairs. Formally, the conjugate-pair algorithm is defined as: 


Equation: 
N/2-1 N/4-1 N/4-1 
—_ nok k nak —k nak 
Xp = ; W/9 LIn, + Wy y Wyjq TAng+1 + wy Wiyjq T4ny—1 
n2=0 n4=0 n4=0 


As with the ordinary split-radix algorithm, a DIT decomposition of the conjugate-pair algorithm can be expressed 
as a system of equations: 


Equation: 


Xx = Uyt+ (wy Ze + wy Zp) 
Xiewp2 = Un- (wy Zn + wy Zh) 
Sia. = tye 8 ORE up) 
Xpi3n/4 = Upinjati (wi Zk — wy Zp) 


where & = 0,---,.N/4 — 1. As can be seen, the trigonometric coefficients are conjugates — a feature that can be 
exploited to reduce twiddle factor loads. 


Tangent 


In 2004, some thirty years after Yavne set the record for the lowest arithmetic operation count, Van Buskirk posted 
software to Usenet that had asymptotically reduced the arithmetic operation count by about 6%. Three papers were 
subsequently published [link], [link], [link] with differing explanations on how to achieve the lowest arithmetic 
operation count initially demonstrated by Van Buskirk. 


Although all three papers describe algorithms that achieve the lowest arithmetic operation count in the same way, 
and thus can be considered to be different views of the same algorithm, all three papers refer to the algorithms by 
different names. Lundy and Van Buskirk [link] refer to their algorithm as “scaled odd tail FFT”, Bernstein [link] 
describes an algorithm named “tangent FFT”, while Johnson and Frigo [link] refer to the algorithm by various 
names. Many works have cited Johnson and Frigo for the algorithm [link]. Of these names, “tangent FFT” is used 
in this work because it is the most descriptive; scaling the twiddle factors into tangent form was the linchpin of 
Van Buskirk's breakthrough in arithmetic complexity. 


Bernstein expresses a DIF decomposition of the tangent FFT in a very concise but somewhat obscure polynomial 
form that was first practised by Fiduccia [link]. In order to be consistent with earlier sections, a DIT decomposition 
of the tangent FFT using linear functions will be described in this section.[ footnote] While the polynomial form is 
more elegant and concise, expressing the FFT in terms of linear functions has the advantage of mapping to 
software or hardware more directly. 

Although derived differently, the underlying structure presented here is identical to the network transpose of 
Bernstein's tangent FFT. In contrast to Johnson and Frigo's algorithm of four sub-transforms, the view presented 
here uses only one sub-transform and a scaled split-radix transform. 


The key to the tangent FFT is Van Buskirk's observation that if the trigonometric constant wh, =cos 6+ 7 sin 0 is 
factored as (1 + 7 tan 6) cos 0 or (cot 6+ 7) sin 6, the multiplication by cos 6 or sin @ can sometimes be 
absorbed elsewhere in the computation, assuming the constants are precomputed, and the remaining multiplication 
by constants of the form +(1 + 7 tan 8) or +(cot @ + 7) now only costs four floating point operations instead of 
six, assuming the usual scheme of complex multiplication using four multiply and two add operations. 


Firstly, consider the conjugate-pair FFT being recursively scaled by a wavelet $ yx: 


Equation: 
X, SN /2,k SN/4,k _pf SN/4,k 
gg aa Pe aa) 
SN,k SN,k SN,k SN,k 
for k = 0,---, N/4 — 1, and where Uj is evaluated with X;/sy/2,4, and Z;, and Z;, are evaluated with 
X;,/8N/4,k- 


The wavelet is crafted such that it is periodic in k (ie., sp = SN,k+N/4) and wh, (sw/4,e/SN,k) is of the form 
+(1+7 tan @) or +(cot 9+ 7). Bernstein defines the wavelet as [link]: 
Equation: 


4 Onk _ (4 2rk 
SN k= [| max cos ae , sin WN 


£>0 


Multiplying Z;, and Z;, by the scaled constants saves a total of four floating point operations, while scaling Uj, 
costs four operations, resulting in no gain or loss. But the cost of scaling the result back to X;, is about 2 real 
operations. In order to realize a reduction in the number of floating point operations, the split-radix FFT is 
decomposed further, so that the extra operations can be absorbed into constants in the sub-transform. Starting with 
the unscaled split-radix FFT (see [link]), the sum over the x2, terms is itself decomposed with a split-radix 
decomposition into @4,,,, £gn,+2 and %g,, 46, resulting in a DFT of five sums: 


Equation: 
N/4-1 N/8-1 N/8-1 
XoS 4ngk (8ng+2)k (8ng+6)k 
k= Why LAng =i Why L8ng+2 T W nz Z8ng+6 
n4g=0 ng=0 ng=0 
N/4-1 N/4-1 
S- (4n4+1)k S- (4n4+3)k 
r N TAangt1 1 N LAn4+3 
n4=0 n4=0 


where n = 4n4 = 8ng. As with earlier decompositions, invariants are factored out to obtain: 
Equation: 


N/4-1 N/8-1 N/8-1 
Anak 2k ngk 6k ngk 
XX, = ; Wy Ld4ng + Wy S Wg T8ng+2 + Wy S Wg T8ns+6 
n4=0 ng=0 ng=0 
N/4-1 N/4-1 
k nak 3k nak 
TT Wy S Wn /4 Langti + Wy S Wn /4 LAni,+3 
n4a=0 n4=0 


Following from the conjugate-pair split-radix algorithm, xg,,,+¢ is shifted cyclically by —8 and x4,,,+3 is shifted 
cyclically by —4 to obtain: 


Equation: 
N/4-1 N/8-1 N/8-1 
Angk 2k ngk —2k ngk 
X= S Wy Lang + Wy S W/g LEngt2 + Wy S Wig L8ng—2 

n4=0 ng=0 ng=0 

N/4-1 N/4-1 
k nak —k nak 
+ Wy S WN /4 LAngt1 + Wy S Wn/4 LAny—1 

n4=0 n4=0 


where £_, = £N—n. Note that the sums over @gp,+2 and £g,,—2 are multiplied by constants that are now 
conjugate pairs, as are the sums over @4n,,+1 and L4n,-1. 


The sum over 24n, is now substituted with U;, (where k = 0,---, N/4 — 1), while the sums over agp,+2 and 
L8n,—2 are respectively substituted with Y;, and Y (where k = 0,---, N/8 — 1) and the sums over #4n,41 and 
LAn4—1 Tespectively substituted with Z, and Z (where k = 0,---, N/4 — 1), simplifying [link] thus: 
Equation: 


Xp, = Un HWY + WHY, + WHT + wy Zh 


As with earlier examples, computation is factored out of [link] by exploiting periodicity in the sub-transforms and 
symmetries in the twiddle factors. [link] is first expressed as a parametric equation of eight parts: 
Equation: 


2(k+pN) 


k+pN 
XkipN = Ukipn + Wy Yi-tpn + we! 


pN) 


—(k+ ! 
Vigan + wae ’ Zev pN + Wy Zi-pN 


where k = 0,---, N/8—1andVp € {0, 2 F - ; 2 ; ; 3 ; 4 ; i }. By exploiting periodicity in the sub-transforms: 


Equation: 
U, = Upinya 
Ye = Yeins 
Yy = Yien/s 
Ze = Zeina 
Z k = Ei k+N/4 


and the following symmetries in the twiddle factors: 


Equation: 
wrt _ gery) = gre _ Sy 
= instr) _ gen -_ seg tht 5/8) 
= ee) 
ae = is _ =i, _ sig, ON) 
_ ig re) _ jg = ig 
_ Sag) 
wh = = - al = ane 
wr! _ <n = uae 7 =iy ee 
ite _ SHA = ay ne = at 
wee _ ge - ig Ce) = a en 
[link] is rewritten thus: 
Equation: 
Xe = Unt (WY t+ APY) + (whee +52) 
Xppw2 = Unt (wY, + whi) — (wh Ze + wy Zj) 
Xppw/a = Un — (wY, + wf) — i (wh Ze — wy Zj) 
Xpsvswjyn = Un — (WHY, + wy") +1 GEA: ea 


ie ~ wy" Y,) + (wi BZucnys + uy TEN Eas) 


N/8 
i (why ee ZKLN/8 — wy" i se) 


XE4N/8 _ Un+N/8 —4 
Xpiznjs = Uninje tt =i ky, 


ae) 


2k 


XE45N/8 = Un+n/3 — 4 


Yi t) - 
mene (uy a fag: 
) 


Xess = Unie + i (WY, — w3P*¥/ +i (wy ee en MOT sas) 


By applying terms with the appropriate scaling, viz. ay, = 8N/4,k/8N,ks BN,k = SN/8,k/SN/2,ks 
YN,k = 8N/4,b+N/8/SN,k+N/8> ON,k = SN/2,b/SN,k aNd EN,k = $N/2,b+N/8/8N,b+N/8> [link] now becomes: 


Equation: 
Xx/SN,k = Upon + (Bn nw Vi + Bn wy Yi) On, + (an pw Zp + an, hwy Zp) 
Xin p/snk = Unone+ (Bn nwn Yn + Bn wwy Vi) one — (annwy Ze + an pw Zh) 
Xpin/a/Sn~eo = Upan,e — (Bn nw Vi + Bn wy Yi) On —1 (an hwy Ze = an hwy Zp) 
Xpi3n/4/Sne = Unorne — (Bn pwrt¥e + Bu ew Vi) one +t (aw wh Ze — an py Zj) 
. 3 k+N/8 ~(k+N/8) , 
Xeinjs/snk = Unyn/synk — 4 (Bn, nwrtY, — Bnew Yx) €N,k + (own / Ze+N/8 + NRW een 
. = ; k+LN/8 ~(k+N/8) 
Xissn/s/snk = Unenjs ive +4 (On nen Ye — Bn rw Vy )enn — 4 (yw,wh? Zaanjs — In nkwye 
: _ k+N/8 ~(k+N/8) , 
Xissw/s/svk = Unenjsyve — 4 (On awn Ye — Bn aw Vy )en,e — (ywwn? Zune tn awy 2 
; = ‘ k+N/8 ~(k+N/8) 
Xei7w/s/Suk = Unsnw/syna tt (Bn nwrtY, — Bnew Y x) ENE +1 (ywwne / Zk+N/8 — YN Rw eu 


Assuming that the scaling factors are absorbed into precomputed twiddle factors where possible (e.g., a NRW, isa 
single precomputed constant), computing [link] requires about (68/8) N real operations, in contrast to (72/8)N 
operations for [link]. Further assuming that operations are skipped in the cases where precomputed constants are of 
the form +1 or +7, a further 28 real operations are saved in [link]. Thus the arithmetic cost of [link] can be 
expressed with the following recurrence relation: 


Equation: 


3T(n/4) + 2T(n/8)+ max {n—12,0}+7.5n—-16 forn>8 


i. _ 16 forn=4 
a 4 for n= 2 
0 forn=1 


Bernstein gives the exact solution of [link] as [link]: 
Equation: 


T(n) = (34/9)n log, n — (142/27)n 
— (2/9)(—1)'8" logy m + (7/27)(—1) "8" +7 


forn > 2. 
[link] is scaled by sz, but if the application is convolution in frequency, the scaling could be absorbed into the 


filter, and the cost of scaling the results back to X; avoided. Otherwise, a split-radix FFT can be used to change 
basis, absorbing the scaling into the twiddle factors of the 4,41 and &4,,,—1 terms: 


Equation: 
Xk = U; + (sv pwh Z, fs Sv RW Zh) 
Xpywj2 = Up— (sv pwh Ze + Sv Wy Zy) 
Xena = Unenja—i(swpwk Ze — Sv KW Zi) 


Xpign/4 = Uninjyatt (sy pw Zp — Su pW Zp) 


where Z; and Z 4 are now recursively computed with the tangent FFT of [link], and the U;, terms are themselves 
computed with [link]. The arithmetic cost of computing the tangent FFT in the traditional basis is thus expressed: 
Equation: 
T’ (n/2) + 2T (n/4) + 8n+ max {3n — 16,0} forn >4 
Tn) = A for n = 2 
0 forn=1 


giving rise to Van Buskirk's exact operation count of [Link]: 
Equation: 


T'(n) = (34/9)n logy n— (124/27)n — 2 logy n 
= (2/9)(—1)'°8" log, n+ (16/27)(—1)'°" 48 


forn > 2. 


Implementation Details 


This Chapter complements the mathematical perspective of Algorithms 
with a more focused view of the low level details that are relevant to 
efficient implementation on SIMD microprocessors. These techniques are 
widely practised by today's state of the art implementations, and form the 
basis for more advanced techniques presented in later chapters. 


Simple programs 


Fast Fourier transforms (FFTs) can be succinctly expressed as 
microprocessor algorithms that are depth first recursive. For example, the 
Cooley-Tukey_FFT divides a size N transform into two size N/2 transforms, 
which in turn are divided into size N/4 transforms. This recursion continues 
until the base case of two size 1 transforms is reached, where the two 
smaller sub-transforms are then combined into a size 2 sub-transform, and 
then two completed size 2 transforms are combined into a size 4 transform, 
and so on, until the size N transform is complete. 


Computing the FFT with such a depth first traversal has an important 
advantage in terms of memory locality: at any point during the traversal, the 
two completed sub-transforms that compose a larger sub-transform will still 
be in the closest level of the memory hierarchy in which they fit (see, i.a., 
[link] and [link]). In contrast, a breadth first traversal of a sufficiently large 
transform could force data out of cache during every pass (ibid.). 


Many implementations of the FFT require a bit-reversal permutation of 
either the input or the output data, but a depth first recursive algorithm 
implicitly performs the permutation during recursion. The bit-reversal 
permutation is an expensive computation, and despite being the subject of 
hundreds of research papers over the years, it can easily account for a large 
fraction of the FFTs runtime — more so for the conjugate-pair algorithm 
with the rotated bit-reversal permutation. Such permutations will be 
encountered in later sections, but for the mean time it should be noted that 
the algorithms in this chapter do not require bit-reversal permutations — the 
input and output are in natural order. 


IF N= 1 
RETURN 2o 
ELSE 
Ep 0a N41 << DITFFT2 yn /2 (Lon, ) 
Ok.=0,---,N/2—1 <— DITFFT2 y/2 (Zon) 
FOR k=0 to N/2-1 
Xp < EB, + ne O; 
Xhinj2 <— Ey, — why Ox 
END FOR 
RETURN X; 
ENDIF 
DITFFT2 (tn) 


Radix-2 


A recursive depth first implementation of the Cooley-Tukey radix-2 
decimation-in-time (DIT) FFT is described with pseudocode in [link], and 
an implementation coded in C with only the most basic optimization — 
avoiding multiply operations where o. is unity in the first iteration of the 
loop — is included in Appendix 1. Even when compiled with a state-of-the- 
art auto-vectorizing compiler,[footnote] the code achieves poor 
performance on modern microprocessors, and is useful only as a baseline 
reference.| footnote] 

Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, 
Version 12.1.0.038 Build 20110811. 

Benchmark methods contains a full account of the benchmark methods. 


Implementation Machine Runtime 


Danielson-Lanczos, 1942 Human 140 
[link] minutes 


Cooley-Tukey, 1965 [link] IBM 7094 ~ 10.5 ms 


Listing 1, Appendix 1, 2011 aaa a 


~ 440 ps 
Performance of simple radix-2 FFT from a historical perspective, for size 
64 real FFT 


However it is worth noting that when considered from a historical 
perspective, the performance does seem impressive — as shown in [link]. 
The runtimes in [link] are approximate; the Cooley-Tukey figure is roughly 
extrapolated from the floating point operations per second (FLOPS) count 
of a size 2048 complex transform given in their 1965 paper [link]; and the 
speed of the reference implementation is derived from the runtime of a size 
64 complex FFT (again, based on the FLOPS count). Furthermore, the 
precision differs between the results; Danielson and Lanczos computed the 
DFT to 3-5 significant figures (possibly with the aid of slide rules or adding 
machines), while the other results were computed with the host machines' 
implementation of single precision floating point arithmetic. 


The runtime performance of the FFT has improved by about seven orders of 
magnitude in 70 years, and this can largely be attributed to the computing 
machines of the day being generally faster. The following sections and 
chapters will show that the performance can be further improved by over 
two orders of magnitude if the algorithm is enhanced with optimizations 
that are amenable to the underlying machine. 


Split-radix 
IF N= 1 
RETURN Zo 


ELSIF N = 2 
Xo + @ + 2 


Xj — Tp — Wi 
ELSE 
Op =0,-++,N/2—1 <— SPLITFFT y/2 (Lan, ) 
Zks=0,---,N/4—1 = SPLITFFT y /4 (Lindt) 
hy—0,---,.N ja < SPLITFFT v/a (£4n,+3) 
FOR k=0 to N/4-1 
X;, <— Un + (why Z, + w% Zi) 
Xiinj2 <— Up — (wy Zp + wh Zi) 
Xiinja < Unpwjs — t (why Ze — wi Zi) 
Xpi3an/4 — Upenja + i (wy Ze — wih Z;) 
END FOR 
ENDIF 
RETURN XX; 
SPLITFFT y (Xn) 


As was the case with the radix-2 FFT in the previous section, the split-radix 
FFT neatly maps from the system of linear functions to the pseudocode of 
[link], and then to the C implementation included in Appendix 1. 


[link] explicitly handles the base case for N = 2, to accommodate not only 
size 2 transforms, but also size 4 and size 8 transforms (and all larger 
transforms that are ultimately composed of these smaller transforms). A 
size 4 transform is divided into two size 1 sub-transforms and one size 2 
transform, which cannot be further divided by the split-radix algorithm, and 
so it must be handled as a base case. Likewise with the size 8 transform that 
divides into one size 4 sub-transform and two size 2 sub-transforms: the 
size 2 sub-transforms cannot be further decomposed with the split-radix 
algorithm. 


Also note that two twiddle factors, viz. wk and wk, are required for the 
split-radix decomposition; this is an advantage compared to the radix-2 
decomposition which would require four twiddle factors for the same size 4 
transform. 


Conjugate-pair 


From a pseudocode perspective, there is little difference between the 
ordinary split-radix algorithm and the conjugate-pair algorithm (see [link]). 
In line 10, the %4,,43 terms have been shifted cyclically by —4 to %4n,-1, 
and in lines 12-15, the coefficient of Z; has been shifted cyclically from 


w* to wy". 
IF N = 1 
RETURN Zo 


ELSIF N = 2 
Xo <- 9 + Z 
Xi — Zo — Zi 


Oky=0,---,N/2-1 <— CONJFFT y/o (Lan, ) 
Zky=0,---,N/4—1 — CONJFFT yy /4 (LAn,+1) 
ka=0,---,N/4—1 = CONJFFT wy /4 (Lan,—1) 
FOR k=0 to N/4-1 
Xp © Ue + (why Ze + wet 2 
Xppwj2 UR — (wk Z, + wat Zi.) 
Xpinja  Uninya — i (wh Ze — wy* Zj) 
Xpisw/a < Uniwja + i (wk, Z, — wy" Z;) 
END FOR 
ENDIF 


RETURN X;, 
CONJFFT y (2,,) 


The source code has a few subtle differences that are not revealed in the 
pseudocode. The pseudocode in [link] requires an array of complex 
numbers x,, for input, but the source code requires a reference to an array of 
complex numbers with a stride[footnote] — this avoids copying x, into three 
separate alrays, VIZ. Zan,, Lan,+1 aNd L4n,—-1, with every invocation of 
[link]. The subtle complication arises due to the cyclic shifting of the 
L4n,—1 term; the negative shifting results in pointers that reference data 


before the start of the array. Rather than immediately wrapping the 
references around to end of the array such that they always point to valid 
data, the recursion proceeds until the base cases are reached before any 
adjustment is performed. Once at the leaves of the recursion, any pointers 
that reference data lying before the start of the input array are incremented 
by N elements,[footnote] so as to point to the correct data. 

A stride of n would indicate that only every n‘” term is being referred to. 
In this case, NV refers to the size of the outer most transform rather than the 
size of the sub-transform. 


IF N = 1 
RETURN Zo 
ELSIF N = 2 


Ux,=0,-,N/2-1 <- TANGENTFFT4 y//2 (x2n,) 

Zh =0,: -N/4-1 <—— TANGENTFFTS8 jy /4 (La4n,+1) 
Ig=0,---N/4-1 <— TANGENTFFT8 w/4 (@4n,-1) 

FOR k=0 to N/4-1 
Xp < Up, + (why SN/4,k Ze + ae SN/4,k Zi) 


Xk4N/2 <+ Ur, —- (why SN/4,k Ze + wa SN/4,k Zi) 
Xpin/a4 — URenja — i (why SN/4,k Zk — win SN/4,k Zi) 


Xpigwia  Unsnja + i (why sujan Ze — wy” $n/4,n Zp) 
END FOR 
ENDIF 
RETURN X;, 
TANGENTFFT4 y (z,) 


IF N = 1 
RETURN Zo 
ELSIF N = 2 


Tk =0,1 <- TANGENTFFT82 (22n, ) 
<~ TANGENTFFTS83 (22n,+1) 

Xo << To = i 

Xo <— Th — i 

Xj — T; + i 

X3; ¢« T, — T! 


U;,,=0,- .N/4-1 = TANGENTFFTS8 jy /4 (Lan, ) 
Veen: -N/8—1 << TANGENTFFTS8 yy, 7g (Lsn_+2 
Ypn=0,.-N/8-1 < TANGENTFFTS8 y//g (%8n3—2 
Zig—0,-,N/4-1 <- TANGENTFFT8 yy /4 (24n4+1 
Z.4-0,--,.N/4-1 <- TANGENTFFTS yy /4 ( 
FOR k=0 to N/8-1 

OQN,k *— SN/4k / 8N,k 

Bvt — Sn/sk / SN/2,k 

YN,k *— §N/4,k4+N/8 / SN, k+N/8 

ONk Sy /2,4/8N,k 

EN,k <— SN/2,.k+N/8/SN,k+N/8 

Qo — wh * ane 

Q +H a * Nk 

QQy <— wer ‘i PN,k 


To < (22 Ye + 22 Vx )*5n, 

Ti © i(M Ye — 22 Yi )*enp 

X, <— U,*an~ + To + (2 Zp + Qo Zi) 
Xpin/2 < Up*anjk +710 — (2% Zr + Qo Zi.) 


) 
) 
) 
) 


LAny— 1 


Xpiw/4 < Up* ann, — Tyo — i (2 Ze — No Zi.) 
Xp43n/4 <— Up*an~ — To + i (2 Zp — 2X Zi) 


Xpiw/s << Unrnys*ynjk — T1 + (21 Zerw/s + Qo Z,,, 
Xpy3n/s  Unsnys*ynz + Ti — i (2 Ze+n/g — Go Z 
Xpysn/s  VUnrns*ynzk — Ti — (2 Zetn/g + Qo Z, 


Xk+7n/8 * Unrnys*ynn + Ti + i (2 Ze+n/g — Qo Z 


END FOR 
ENDIF 
RETURN X* 
TANGENTFFTS8 y (xn) 


Tangent 


The tangent FFT is divided into two functions, described with pseudocode 
in [link] and [link]. If the tangent FFT is computed prior to convolution in 
the frequency domain, the convolution kernel can absorb the final scaling 


and only [link] is required. Otherwise [link] is used as a wrapper around 
[link] to perform the rescaling, and the result X;, is in the correct basis. 


[link] is similar to [link], except that Z, and Z : are computed with [link], 
and thus scaled by 1/sy /4,k- Because Z, and Z ;, are respectively multiplied 
by the coefficients w*, and ne , the results are scaled into the correct basis 
by absorbing $74, into the coefficients. 


[link] is almost a 1:1 mapping of the system of linear equations, except that 
the base cases of N = 1, 2, 4 are handled explicitly. In [link], the case of 
N = 4 is handled with two size 2 base cases, which are combined into a 
size 4 FFT. 


Putting it all together 
1.8 
® Tangent 
1.6 @ Cooley-Tukey 
® Conjugate-pair 


® Solit-radix 


Speed (GFLOPS) 
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Speed of simple FFT implementations 


The simple implementations covered in this section were benchmarked for 
sizes of transforms 2? through to 218 running on a Macbook Air 4,2 and the 
results are plotted in [link]. The speed of each transform is measured in 
Cooley-Tukey gigaflops (CTGs), where a higher measurement indicates a 
faster transform.| footnote] 

CTGs are an inverse time measurement. See Benchmark methods for a full 
explanation of the benchmarking methods. 


It can be seen from [link] that although the conjugate-pair and split-radix 
algorithms have exactly the same FLOP count, the conjugate-pair algorithm 
is substantially faster. The difference in speed can be attributed to the fact 
that the conjugate-pair algorithm requires only one twiddle factor per size 4 
sub-transform, whereas the ordinary split-radix algorithm requires two. 


Though the tangent FFT requires the same number of twiddle factors but 
uses fewer FLOPs compared to the conjugate-pair algorithm, its 
performance is worse than the radix-2 FFT for most sizes of transform, and 
this can be attributed to the cost of computing the scaling factors. 


A simple analysis with a profiling tool reveals that each implementations’ 
runtime is dominated by the time taken to compute the coefficients. Even in 
the case of the conjugate-pair algorithm, over 55% of the runtime is spent 
calculating the complex exponential function. Eliminating this performance 
bottleneck is the topic of the next section. 


Precomputed coefficients 


The speed of [link] — [link] may be dramatically improved if the 
coefficients are precomputed and stored in a lookup table (LUT). 


When computing an FFT of size N, [link] requires N/2 different twiddle 
factors that correspond to N’/2 samples of a half rotation around the 
complex plane. Rather than storing N’/2 complex numbers, the symmetries 
of the sine and cosine waves that compose wr may be exploited to reduce 
the storage to N/4 real numbers — a 75% reduction in memory — by storing 
only one quadrant of a sine or cosine wave from which the real and 
imaginary parts of any twiddle factor can be constructed. Such a scheme 


has advantages in hardware implementations where LUT memory is a 
costly resource [link], but for modern microprocessor implementations of 
the FFT, it is more advantageous to have a less complex indexing scheme 
and better memory locality, rather than a smaller LUT. 


As already mentioned, each transform of size N that is computed with 


[link] requires N/2 twiddle factors from wh through to wr! 7 but the two 


sub-transforms of [link] require twiddle factors ranging from way 2 through 
to ans The twiddle factors of the sub-transforms can be obtained by 
downsampling the parent transform's twiddle factors by a factor of 2, and 
because the downsampling factors are all powers of 2, simple shift 
operations can be used to index any twiddle factor anywhere in the 
transform from one LUT. 


Appendix 2 contains listings of source code that augment each of the simple 
implementations from the previous section with LUTs of precomputed 
coefficients. The modifications are fairly minor: each implementation now 
has an initialization function that populates the LUT(s) based on the size of 
the transform to be computed, and each transform function now has a 
parameter of logs (stride), so as to economically index the twiddle factors 
with little computation. 
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Speed of FFTs with precomputed coefficients 


As [link] shows, the speedup resulting from the precomputed twiddle LUT 
is dramatic — sometimes more than a factor of 6 (cf. [link]). Interestingly, 
the ordinary split-radix algorithm is now faster than the conjugate-pair 
algorithm, and inspection of the compiler output shows that this is due to 
the more complicated addressing scheme at the leaves of the computation, 
and because the compiler lacks good heuristics for complex multiplication 
by a conjugate. The performance of the tangent FFT is hampered by the 
same problem, yet the tangent FFT has better performance, which can be 
attributed to the tangent FFT having larger straight line blocks of code at 
the leaves of the computation (the tangent FFT has leaves of size 4, while 
the split-radix and conjugate-pair FFTs have leaves of size 2). 


Single instruction, multiple data 


The performance of the programs in the previous section may be further 
improved by explicitly describing the computation with SIMD intrinsics. 
Auto-vectorizing compilers, such as the Intel C compiler used to compile 
the previous examples, can extract some data-level parallelism and generate 


SIMD code from a scalar description of a computation, but better results 
can be obtained when using vector intrinsics to explicitly specify the 
parallel computation. 


Intrinsics are an alternative to inline assembly code when the compiler fails 
to meet performance constraints. In most cases an intrinsic function directly 
maps to a single instruction on the underlying machine, and so intrinsics 
provide many of the advantages of inline assembler code. But in contrast to 
inline assembler code, the compiler uses its detailed knowledge of the 
intrinsic semantics to provide better optimizations and handle tasks such as 
register allocation. 


Almost all desktop and handheld machines now have processors that 
implement some sort of SIMD extension to the instruction set. All major 
Intel processors since the Pentium III have implemented SSE, an extension 
to the x86 architecture that introduced 4-way single precision floating point 
computation with a new register file consisting of eight 128-bit SIMD 
registers — known as XMM registers. The AMD64 architecture doubled the 
number of XMM registers to 16, and Intel followed by implementing 16 
XMM registers in the Intel 64 architecture. SSE has since been expanded 
with support for other data types and new instructions with the introduction 
of SSE2, SSE3, SSSE3 and SSE4. Most notably, SSE2 introduced support 
for double precision floating point arithmetic and thus Intel's first attempt at 
SIMD extensions, MMX, was effectively deprecated. Intel's recent 
introduction of the sandybridge micro-architecture heralded the first 
implementation of AVX — a major upgrade to SSE that doubled the size of 
XMM registers to 256 bits (and renamed them YMM registers), enabling 8- 
way Single precision and 4-way double precision floating point arithmetic. 


Another notable example of SIMD extensions implemented in commodity 
microprocessors is the NEON extension to the ARMv7 architecture. The 
Cortex family of processors that implement ARMv7 are widely used in 
mobile, handheld and tablet computing devices such as the iPad, iPhone and 
Canon PowerShot A470, and the NEON extensions provide these 
embedded devices with the performance required for processing audio and 
video codecs as well as graphics and gaming workloads. 


Compared to SSE and AVX, NEON has some subtle differences that can 
greatly improve performance if used properly. First, it has dual length 
SIMD vectors that are aliased over the same registers; a pair of 64-bit 
registers refers to the lower and upper half of one 128-bit register — in 
contrast, the AVX extension increases the size of SSE registers to 256-bit, 
but the SSE registers are only aliased over the lower half of the AVX 
registers. Second, NEON can interleave and de-interleave data during 
vector load or store operations, for up to four vectors of four elements 
interleaved together. In the context of FFTs, the interleaving/de-interleaving 
instructions can be used to reduce or eliminate vector permutations or 
shuffles. 


Split format vs. interleaved format 


In the previous examples, the data was stored in interleaved format (i.e., the 
real and imaginary parts composing each element of complex data are 
stored adjacently in memory), but operating on the data in split format (i.e., 
the real parts of each element are stored in one contiguous array, while the 
imaginary parts of each element are stored contiguously in another array) 
can simplify the computation when using SIMD. The case of complex 
multiplication illustrates this point. 


static inline __m128 MUL_INTERLEAVED(__m128 a, 
_mi28 b) { 
__mi28 re, im; 


re = _mm_shuffle_ps(a,a,_MM_SHUFFLE(2,2,0,0)); 

re = _mm_mul_ps(re, b); 

im = _mm_shuffle_ps(a,a,_MM_SHUFFLE(3,3,1,1)); 

b = _mm_shuffle_ps(b,b, MM_SHUFFLE(2,3,0,1)); 

im = _mm_mul_ps(im, b); 

im = _mm_xor_ps(im, _mm_set_ps(0.0f, -0.0f, 0. 
Of, -0.0f)); 


return _mm_add_ps(re, im); 


i 


SSE multiplication with interleaved complex data 


Interleaved format complex multiplication 


The function in [link] takes complex data in two 4-way single precision 
SSE registers (a and b) and performs complex multiplication, returning the 
result in a single precision SSE register. The SSE intrinsic functions are 
prefixed with “_mm_', and the SSE data type corresponding to a single 128- 
bit single precision register is ~__m128'. 


When operating with interleaved data, each SSE register contains two 
complex numbers. Two shuffle operations at lines 3 and 5 are used to 
replicate the real and imaginary parts (respectively) of the two complex 
numbers in input a. At line 4, the real and imaginary parts of the two 
complex numbers in b are each multiplied with the real parts of the complex 
numbers in a. A third shuffle is used to swap the real and imaginary parts of 
the complex numbers in b, before being multiplied with the imaginary parts 
of the complex numbers in a — and the exclusive or operation at line 8 is 
used to selectively negate the sign of the real parts in this result. Finally, the 
two intermediate results stored in the re and im registers are added. In total, 
seven SSE instructions are used to multiply two pairs of single precision 
complex numbers. 


Split format complex multiplication 


typedef struct _reg_t { 
__mi28 re, im; 
$} reg_t; 


static inline reg_t MUL_SPLIT(reg_t a, reg_t b) 

{ 

reg_t r; 

r.re = _mm_sub_ps(_mm_mul_ps(a.re,b.re),_mm_mu 
1_ps(a.im,b.im)); 

r.im = _mm_add_ps(_mm_mul_ps(a.re,b.im),_mm_mu 
1_ps(a.im,b.re)); 

return r; 


SSE multiplication with split complex data 


The function in [link] takes complex data in two structs of SSE registers, 
performs the complex multiplication of each element of the vectors, and 
returns the result in a struct of SSE registers. Each struct is composed of a 
register containing the real parts of four complex numbers, and another 
register containing the imaginary parts — so the function in [link] is 
effectively operating on vectors twice as long as the function in [link]. The 
benefit of operating in split format is obvious: the shuffle operations that 
were required in [link] are avoided because the real and imaginary parts can 
be implicitly swapped at the instruction level, rather than by awkwardly 
manipulating SIMD registers at the data level of abstraction. Thus, [link] 
computes complex multiplication for vectors twice as long while using one 
less SSE instruction — not to mention other advantages such as reducing 
chains of dependent instructions. The only disadvantage to the split format 
approach is that twice as many registers are needed to compute a given 
operation — this might preclude the use of a larger radix or force register 
paging for some kernels of computation. 


Fast interleaved format complex multiplication 


[link] is fast method of interleaved complex multiplication that may be used 
in situations where one of the operands can be unpacked prior to 
multiplication — in such cases the instruction count is reduced from 7 
instructions to 4 instructions (cf. [link]). This method of complex 
multiplication lends itself especially well to the conjugate-pair algorithm 
where the same twiddle factor is used twice — by doubling the size of the 
twiddle factor LUT, the multiplication instruction count is reduced from 14 
instructions to 8 instructions. Furthermore, large chains of dependent 
instructions are reduced, and in practice the actual performance gain can be 
quite impressive. 


Operand a in [link] has been replaced with two operands in [link]: re and im 
— these operands have been unpacked, as was done in lines 3 and 5 of [link]. 


Furthermore, line 8 of [link] is also avoided by performing the selective 
negation during unpacking. 


static inline __mi28 
MUL_UNPACKED_INTERLEAVED(__ m1i28 re, __m128 im, 


_mi28 b) { 
re = _mm_mul_ps(re, b); 
im = _mm_mul_ps(im, b); 
im = _mm_shuffle_ps(im, im, MM_SHUFFLE(2,3,0,1) 


i 


return _mm_add_ps(re, im); 


SSE multiplication with partially unpacked 
interleaved data 


Vectorized loops 


The performance of the FFTs in the previous sections can be increased by 
explicitly vectorizing the loops. The Macbook Air 4,2 used to compile and 
run the previous examples has a CPU that implements SSE and AVX, but 
for the purposes of simplicity, SSE intrinsics are used in the following 
examples. The loop of the radix-2 implementation is used as an example in 
[link]. 


for(k=0;k<N/2;k++) { 
data_t Ek = out[k]; 
data_t Ok = out[(k+N/2)]; 
data_t w = LUT[k<<log2stride]; 
out[k] = Ek + w * Ok; 
out[(k+N/2) |] = Ek - w * Ok; 
i 
Inner loop of radix-2 Cooley-Tukey FFT 


Each iteration of the loop in [link] accesses two elements of complex data 
in the array out, and one complex element from the twiddle factor LUT. 
Over multiple iterations of the loop, out is accessed contiguously in two 


places, but the LUT is accessed with a non-unit stride in all sub-transforms 
except the outer transform. Some vector machines can perform what are 
known as vector scatter or gather memory operations — where a vector 
gather could be used in this case to gather elements from the LUT that are 
separated by a stride. But SSE only supports contiguous or streaming access 
to memory. Thus, to efficiently compute multiple iterations of the loop in 
parallel, the twiddle factor LUT is replaced with an array of LUTs — each 
corresponding to a sub-transform of a particular size. In this way, all 
memory accesses for the parallelized loop are contiguous and no memory 
bandwidth is wasted. 


for (k=0;k<N/2;k+=4) { 

__mi28 Ok_re = _mm_load_ps((float *)&out[k+N/2 
]); 

__mi28 Ok_im = _mm_load_ps((float *)&out[k+N/2 
+2]); 

__mi28 w_re = _mm_load_ps((float *)&LUT[log2st 
ride|[k]); 

__mi28 w_im = _mm_load_ps((float *)&LUT[log2st 
ride][k+2]); 


__mi28 Ek_re = _mm_load_ps((float *)&out[k]); 
__mi28 Ek_im = _mm_load_ps((float *)&out[k+2]) 
__mi28 wOk_re = 


_mm_sub_ps(_mm_mul_ps(Ok_re,w_re),_mm_mul_ps 
(Ok_im,w_im)); 
__mi28 wOk_im = 
_mm_add_ps(_mm_mul_ps(Ok_re,w_im),_mm_mul_ps 
(Ok_im,w_re)); 
_mm_store_ps((float *) 
(out+k), _mm_add_ps(Ek_re, wOk_re)); 
_mm_store_ps((float *) 
(out+k+2), _mm_add_ps(Ek_im, wOk_im)); 
_mm_store_ps((float *) 
(out+k+N/2), _mm_sub_ps(Ek_re, wOk_re)); 
_mm_store_ps((float *) 
(out+k+N/2+2), _mm_sub_ps(Ek_im, wOk_im)); 


Vectorized inner loop of Cooley-Tukey radix-2 FFT 


[link] computes the loop of [link] using split format data and a vector length 
of four (i.e., it computes four iterations at once). Note that the vector load 
and store operations used in [link] require that the memory accesses are 16- 
byte aligned — this is a fairly standard proviso for vector memory 
operations, and use of the correct memory alignment attributes and/or 
memory allocation routines ensures that memory is always correctly 
aligned. 


Some FFT libraries require the input to be in split format (i.e., the real parts 
of each element are stored in one contiguous array, while the imaginary 
parts are stored contiguously in another array) for the purposes of 
simplifying the computation, but this conflicts with many other libraries and 
use cases of the FFT — for example, Apple's vDSP library operates in split 
format, but many examples require the use of un-zip/zip functions on the 
input/output data (see Usage Case 2: Fast Fourier Transforms in [link]). A 
compromise is to convert interleaved format data to split format on the first 
pass of the FFT, computing most of the FFT with split format sub- 
transforms, and converting the data back to interleaved format as it is 
processed on the last pass. 
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Speed of FFTs with vectorized loops 


Appendix 3 contains listings of FFTs with vectorized loops. The input and 
output of the FFTs is in interleaved format, but the computation of the inner 
loops is performed on split format data. At the leaves of the transform there 
are no loops, so the computation falls back to scalar arithmetic. 


[link] summarizes the performance of the listings in Appendix 3. 
Interestingly, the radix-2 FFT is faster than both the conjugate-pair and 
ordinary split-radix algorithms until size 4096 transforms, and this is due to 
the conjugate-pair and split-radix algorithms being more complicated at the 
leaves of the computation. The radix-2 algorithm only has to deal with one 
size of sub-transform at the leaves, but the split-radix algorithms have to 
handle special cases for two sizes, and furthermore, a larger proportion of 
the computation takes place at the leaves with the split-radix algorithms. 
The conjugate-pair algorithm is again slower than the ordinary split-radix 
algorithm, which can (again) be attributed to the compiler's relatively 
degenerate code output when computing complex multiplication with a 
conjugate. 


Overall, performance improves with the use of explicit vector parallelism, 
but still falls short of the state of the art. The next section characterizes the 
remaining performance bottlenecks. 


The performance bottleneck 


» vector output » scalar output » Scalar input 
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Memory access pattern of straight line blocks of code in a 
size 64 radix-2 FFT 


The memory access patterns of an FFT are the biggest obstacle to 
performance on modern microprocessors. To illustrate this point, [link] 
visualizes the memory accesses of each straight line block of code in a size 
64 radix-2 DIT FFT (the source code of which is provided in Appendix 3). 


The vertical axis of [link] is memory. Because the diagram depicts a size 64 
transform there are 64 rows, each corresponding to a complex word in 
memory. Because the transform is out-of-place, there are input and output 
arrays for the data. The input array contains the data “in time”, while the 
output array contains the result “in frequency”. Rather than show 128 rows 
— 64 for the input and 64 for the output — the input array's address space has 
been aliased over the output array's address space, where the orange code 
indicates an access to the input array and the green and blue codes for 
accesses to the output array. 


Each column along the horizontal axis represents the memory accesses 
sampled at each kernel (i.e., butterfly) of the computation, which are all 
straight line blocks of code. The first column shows two orange and one 
blue memory operations, and these correspond to a radix-2 computation at 
the leaves reading two elements from the input data, and writing two 
elements into the output array. The second column shows a similar radix-2 
computation at the leaves: two elements of data are read from the input at 
addresses 18 and 48, the size 2 DFT computed, and the results written to the 
output array at addresses 2 and 3. 


There are columns that do not indicate accesses to the input array, and these 
are the blocks that are not at the leaves of the computation. They load data 
from some locations in the output, performing the computation, and store 
the data back to the same locations in the output array. 


There are two problems that [link] illustrates. The first is that the accesses 
to the input array — the samples “in time" — are indeed very decimated, as 
might be expected with a decimation-in-time algorithm. Second, it can be 
observed that the leaves of the computation are rather inefficient, because 
there are large numbers of straight line blocks of code performing scalar 
memory accesses, and no loops of more than a few iterations (i.e., the 
leaves of the computation are not taking advantage of the machine's SIMD 
capability). 


[link] in the previous section showed that the vectorized radix-2 FFT was 
faster than the split-radix algorithms up to size 4096 transforms; a 
comparison between [link] and [link] helps explain this phenomenon. The 
split-radix algorithm spends more time computing the leaves of the 


computation (blue), so despite the split-radix algorithms being more 
efficient in the inner loops of SIMD computation, the performance has been 
held back by higher proportion of very small straight line blocks of code 
(corresponding to sub-transforms smaller than size 4) performing scalar 
memory accesses at the leaves of the computation. 


Because the addresses of memory operations at the leaves are a function of 
variables passed on the stack, it is very difficult for a hardware prefetch unit 
to keep these leaves supplied with data, and thus memory latency becomes 
an issue. In later chapters, it is shown that increasing the size of the base 
cases at the leaves improves performance. 
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Memory access pattern of straight line blocks of 
code in a size 64 split-radix FFT 


Existing Libraries 


Owing to the importance of efficiently computing FFTs in signal processing 
and other areas, there have been many implementations for 
microprocessors; FFTW's benchmark software, for example, includes a 
collection of 25 different FFT implementations. However, of the many 
implementations, only a few have competed with the state of the art over 
the last fifteen years. Since its first release in 1997, FFTW has risen to 
become one of the most well known fast Fourier transform libraries. Other 
libraries reviewed in this chapter are SPIRAL, UHFFT, djbfft, Apple vDSP, 
MatrixFFT, and Intel IPP. 


The “Fastest Fourier transform in the west” (FFTW) 


FFTW [link], [link], [link] is an implementation of the DFT that attempts to 
automatically adapt to the hardware in order to maximise performance, and 
its development in 1997 was predicated on the idea that it had become too 
complicated to optimize the performance of the fast Fourier transform for 
modern microprocessors. 


The latest release of FFTW, version 3.3, generates a library of over 150 
“codelets” at compile time. The codelets are fragments of machine- 
independent straight-line code derived from DFT algorithms, including the 
Cooley-Tukey [link] algorithm and its derivatives the split-radix [link], 
[link], conjugate-pair [link], [link] and mixed-radix algorithms. Radar [link] 
and Bluestein [link], [link], [link] algorithms are used for sizes that are 
prime, and the prime-factor algorithm [link], [link] for sizes that are 
factored by co-primes. At runtime, a plan for a specific problem, e.g., 1024 
point 1D forward double precision out-of-place DFT, is generated by 
searching the huge space of possible codelet configurations for the best 
solution. 


The codelet generator operates in four phases: creation, simplification, 
scheduling, and unparsing (code generation). During creation, the codelet 
generator produces a representation of the computation in the form of a 
DAG. The DAG is expressed in terms of complex numbers [link], and can 
be viewed as a linear network [link]. In the simplification stage, algebraic 


transformations and common subexpression elimination rewriting 

rules [link] are applied to each node of the DAG, which is then 
topologically sorted to produce a schedule. In a 2008 paper [link], Johnson 
and Frigo contend that “the compiler needs help with such long blocks of 
code", and an earlier paper from 1999 [link] is cited to support the 
hypothesis that compilers are not capable of efficiently allocating registers 
and scheduling code for hard-coded blocks of about size 64, which 
compares an earlier version of FFTW compiled with an older 
compiler[footnote] to an FFT from Sun's Performance Library. There is no 
mention of re-testing the aforementioned hypothesis with more advanced 
compilers. 

Sun WorkShop Compilers 4.2 30 Oct 1996 C 4.2 


FFTW has several modes available for searching the configuration space of 
codelets. In “patient” mode, FFTW uses dynamic programming to evaluate 
the runtime of almost all combinations of possible plans. As the runtime of 
many sub-problems is repeatedly evaluated while searching the 
configuration space, the results of locally optimized sub-problems are 
cached, reducing runtime of the planner while producing results very close 
to that of an exhaustive search. 


In “estimate” mode, FFTW minimizes a heuristic cost that is a function of a 
particular configuration's count of floating point operations and extraneous 
memory operations (for buffering and transposes). Compared to patient 
mode, the runtime of the planner is reduced by several orders of magnitude, 
at the expense of runtime performance while executing the plan. For 
executing plans of 1D complex transforms on a PowerPC G5, the median 
and peak difference in runtime performance between patient and estimate 
modes was 20% and 72%, respectively. This result is used by Frigo and 
Johnson to support the hypothesis that there is no longer any correlation 
between operation counts and runtime performance on modern 

machines [link]. 


Frigo and Johnson discuss a small number of planner solutions in their 2005 
paper on the design of FFTW3 [link], and conclude that “we do not really 
understand the planner's choices because we cannot predict what plans will 
be produced. Indeed, this is the whole point of implementing a planner.” 


They do not mention the use of more rigorous methods, such as machine 
learning, for the purposes of predicting performance. 


FFTW supports computation of complex DFTs with SIMD extensions by 
means of two-way parallel computation of real DFTs [link]. The Vienna 
MAP vectorizer [link], [link], [link] has also been coupled with FFTW to 
produce a high-performance FFT library for the IBM Blue Gene/L 
supercomputer [link] that is up to 80% faster than the best-performing 
scalar FFT codes generated by FFTW [link]. 


Daniel Bernstein's FFT (djbfft) 


In 1997, Daniel Bernstein noticed that it was not difficult to write code that 
out-performed FFTW [link]. He had written 86 lines of unscheduled code 
that computed a size 256 single precision transform in about 35000 Pentium 
cycles — faster than FFTW. After spending a few more days doing “some 
casual instruction scheduling,” he could compute the same transform with 
about 24000 Pentium cycles (ibid.). 


These performance results directly contradicted the assumption that 
predicated FFTW: that it was too hard to predict the performance of FFT 
code on modern microprocessors. Development of djbfft continued until 
1999, and it had succeeded in becoming the fastest library for computing 
FFTs on most Pentium and SPARC machines. 


Bernstein's FFT is notable for having been the first publicly available 
library to exploit the advantages of the conjugate-pair or “-1 exponent” 
algorithm. After Bernstein demonstrated the advantages of the algorithm in 
djbfft, Frigo and Johnson followed with an implementation in FFTW [link]. 


SPIRAL 


SPIRAL [link], [link], [link] attempts to automatically optimize code for 
signal processing functions such as the discrete Fourier transform. 
SPIRAL's goal is to automatically optimize signal processing functions at 
the push of a button, with results that are as good as hand-optimized codes. 


In contrast to FFTW, SPIRAL's optimization is performed at compile time, 
and thus the generated code is less portable. Another point of difference is 
in the search methods: while FFTW uses dynamic programming, SPIRAL 
uses a wide range of techniques that include machine learning [link], [link]. 


Franchetti and Puschel argue that vectorization is best performed at the 
algorithm level of abstraction by manipulating Kronecker product 
expressions through mathematical identities [link], and this is the basis for a 
rewriting system [link] that vectorizes for short vector machines [link], 
[link], [link]. 


In [link], SPIRAL is slower than FFTW 3.1 for 2-way double-precision 
power of two transforms, but SPIRAL is fastest for 4-way single-precision 
power of two transforms where 16 < n < 128. SPIRAL generates code 
that is characterized by large basic blocks and single-threaded performance 
does not scale beyond sizes of about 4096 points. Indeed, source code is 
only publicly available for sizes 2 through to 8192 points [link]. 


UHFFT 


UHFFT [link], [link], [link], [link], [link], like FFTW, generates a library of 
codelets which are assembled into transforms by a planner. The planner 
uses dynamic programming to search an exponential space of possible 
algorithms, factors and schedules, relying on codelet timings to predict 
transform execution times [link]. 


UHFFT uses the mixed-radix and split-radix [link], [link] algorithms for 
power of two sizes, the prime-factor algorithm [link], [link] for sizes that 
are factored by co-primes, and the Radar [link] algorithm for sizes that are 
prime. 


UHFFT generates a schedule from a DAG which has been topologically 
sorted, mainly to optimize memory reuse distance [link]. The schedule is 
then unparsed to C code. 


Scalar results on Itanium2 and Opteron show that UHFFT's dynamic 
programming approach can choose a plan having performance within 10% 


of the actual optimal plan. For power of two sizes, UHFFT's performance 
was typically worse than FFTW or Intel MKL, while UHFFT was faster 
than FFTW for prime-factor and prime sizes (ibid.). 


Intel Integrated Performance Primitives (IPP) 


Of the closed source FFT implementations, the IPP library [link] provides 
the best results for most sizes of DFT on machines with Intel processors. 
IPP includes a number of different FFT implementations that appear to be 
hand optimized for different machine configurations, and in contrast to 
FFTW, IPP deterministically chooses the best code to run based on the 
capabilities of the machine and the OS — achieving results that are typically 
superior to FFTW. 


Because IPP is closed source, there is no publicly available information 
regarding the algorithms and techniques used. 


Apple vDSP 


The Apple Accelerate libraries contain a wide range of computationally 
intensive functions that have been optimized for vector computation on 
PowerPC, x86 and ARM architectures. Within the Accelerate library, vDSP 
is a collection of DSP functions that includes the FFT. 


The vDSP implementation of the FFT is distinctive among the other 
libraries reviewed in this chapter in that it only operates on data that is 
stored in split format (where the real and imaginary parts of complex 
numbers are stored in separate arrays). However, many applications have 
data that is already in interleaved format (where the real and imaginary part 
of each complex number are stored adjacent in memory), or require data in 
interleaved format, and so vDSP provides un-zip/zip functions for 
converting data to/from split format. 


The Apple vDSP library is notable for having very good FFT performance 
on ARM NEON devices, while its x86 performance is average (comparable 
with FFTW “estimate” mode performance). 


As with IPP, vDSP is only distributed in binary form and thus little can be 
said about the algorithms and techniques employed. 


MatrixFFT 


MatrixFFT is a library for efficiently computing large transforms of more 
than 2'8 points on Apple hardware, with sustained processing rates 
reportedly being as high as 40 CTGs for very large single precision 
transforms. Large scale FFTs have been used in areas such as image 
processing (with images of over 10° pixels) and experimental mathematics 
(for extreme-precision computation of 77). 


MatrixFFT uses the four-step algorithm to decompose a transform into 
smaller sub-transforms that fit in the cache [link], and computes the smaller 
sub-transforms with Apple vDSP. Interestingly, MatrixFFT has better 
performance — in many cases — while using interleaved format to store the 
data, even though the interleaved format must be converted to split format 
before using vDSP [link]. 


MatrixFFT includes a calibration utility that evaluates the various 
implementation parameters for each size of transform on a given machine, 
which can then be used to re-compile the library so that it achieves best 
performance on that particular machine. 


MatrixFFT is freely available and distributed in source code form by 
Apple [link]. 


Streaming FFT 


This chapter describes SFFT: a high-performance FFT library for SIMD 
microprocessors that is, in many cases, faster than the state of the art FFT 
libraries reviewed in Existing libraries. 


Implementation details described some simple implementations of the FFT and 
concluded with an analysis of the performance bottlenecks. The 
implementations presented in this chapter are designed to improve spatial 
locality, and utilize larger straight line blocks of code at the leaves, 
corresponding to sub-transforms of sizes 8 through to 64, in order to reduce 
latency and stack overheads. 


In distinct contrast to the simple FFT programs of Chapter 3, this chapter 
employs meta-programming. Rather than describe FFT programs, we describe 
programs that statically elaborate the FFT into a DAG of nodes representing 
the computation, apply some optimizing transformations to the graph, and then 
generate code. Many other auto-vectorization techniques, such as those 
employed by SPIRAL, operate at the instruction level [link], but the techniques 
presented in this chapter vectorize blocks of computation at the algorithm level 
of abstraction, thus enabling some of the algorithms structure to be utilized. 


Three types of implementation are described in this chapter, and the 
performance of each depends on the parameters of the transform to be 
computed and the characteristics of the underlying machine. For a given 
machine and FFT to be computed (which has parameters such as length and 
precision), the fastest configuration is selected from among a small set of up to 
eight possible FFT configurations — a much smaller space compared to FFTW's 
exhaustive search of all possible FFTs. The fastest configuration is easily 
selected by timing each of the possible options, but it is shown in Results and 
discussion that it is also possible to use machine learning to build a classifier 
that will predict the fastest based on attributes such as the size of the cache. 


SFFT comprises three types of conjugate-pair implementation, which are: 


1. Fully hard-coded FFTs; 
2. Four-step FFTs with hard-coded sub-transforms; 
3. FFTs with hard-coded leaves. 


Fully hard-coded 


Statically elaborating a DAG that represents a depth-first recursive FFT is 
much like computing a depth-first recursive FFT: instead of performing 
computation at the leaves of the recursion and where smaller DFTs are 
combined into one, a node representing the computation is appended to the end 
of a list, and the list of nodes, i.e., a topological ordering of the DAG, is later 
translated into a program that can be compiled and executed. 


Emitting code with a vector length of 1 (i.e., scalar code or vector code where 
only one complex element fits in a vector register) is relatively simple and is 
described in '"Vector length 1". For vector lengths above 1, vectorizing the 
topological ordering of nodes poses some subtle challenges, and these details 
are described in "Other vector lengths". The fully hard-coded FFTs described in 
this section are generally only practical for smaller sizes of transforms, 
typically where NV < 128, however these techniques are expanded in later 
sections to scale the performance to larger sizes. 


Vector length 1 


A VL of 1 implies that the computation is essentially scalar, and only one 
complex element can fit in a vector register. An example of such a scenario is 
when using interleaved double-precision floating-point arithmetic on an SSE2 
machine: one 128-bit XMM register is used to store two 64-bit floats that 
represent the real and imaginary parts of a complex number. 


When V L = 1, the process of generating a program for a hard-coded FFT is as 
follows: 


1. Elaborate a topological ordering of nodes, where each node represents 
either a computation at the leaves of the transform, or a computation in the 
body of the transform (i.e., where smaller sub-transforms are combined 
into a larger transform); 

2. Write the program header to output, including a list of variables that 
correspond to registers used by the nodes; 

3. Traverse the list of nodes in order, and for each node, emit a statement that 
performs the computation represented by the given node. If a node is the 


last node to use a variable, a statement storing the variable to its 
corresponding location in memory is also emitted; 
4. Write the program footer to output. 


Elaborate 


[link] is a function, written in C++, that performs the first task in the process. 
As mentioned earlier, elaborating a topological ordering of nodes with a depth- 
first recursive structure is much like actually computing an FFT with a depth- 
first recursive program (cf. Listing 3 in Appendix 2). [link] lists the nodes 
contained in the list ‘ns’ after elaborating a size-8 transform by invoking 
elaborate(8, 0, 9, 0). 


CSplitRadix::elaborate(int N, int ioffset, int offs 
et, int stride) { 
if(N > 4) £ 
elaborate(N/2, ioffset, offset, stride+1); 
if(N/4 >= 4) £ 
elaborate(N/4, ioffset+(1<<stride), offset+ 
(N/2), stridet2); 
elaborate(N/4, ioffset-(1<<stride), offset+ 
(3*N/4), stride+2); 
telsef{ 
CNodeLoad *n = new CNodeLoad(this, 4, ioffset 
, stride, 0); 
ns.push_back(assign_leaf_registers(n)); 


J 
for(int k=0;k<N/4;k++) { 
CNodeBfly *n = new CNodeBfly(this, 4, k, stri 
de); 
ns.push_back(assign_body_registers(n,k,N); 


} 
selse if(N==4) { 
CNodeLoad *n = new CNodeLoad(this, 4, ioffset, 
stride, 1); 
ns.push_back(assign_leaf_registers(n)); 
selse if(N==2) { 
CNodeLoad *n = new CNodeLoad(this, 2, ioffset, 


stride, 1); 
ns.push_back(assign_leaf_registers(n)); 
b; 


Elaborate function for hard-coded conjugate-pair FFT 


A transform is divided into sub-transforms with recursive calls at lines 4, 6 and 
7, until the base cases of size 2 or size 4 are reached at the leaves of the 
elaboration. As well as the size-2 and size-4 base cases, which are handled at 
lines 20-21 and 17-18 (respectively), there is a special case where two size-2 
base cases are handled in parallel at lines 9-10. This special case of handling 
two size-2 base cases as a larger size-4 node ensures that larger transforms are 
composed of nodes that are homogeneous in size — this is of little utility when 
emitting VL = 1 code, but it is exploited in "Other vector lengths" where the 
topological ordering of nodes is vectorized. The second row of [Link] is just 
such a special case, since two size-2 leaf nodes are being computed, and thus 
the size is listed as 2(x2). 


The elaborate function modifies the class member variable “ns' at lines 10, 
14, 18 and 21, where it appends a new node to the back of the list. After the 
function returns, the ns list represents a topological ordering of the 
computation with CNodeLoad and CNodeBf Ly nodes. The nodes of type 
CNodeLoad represent leaf computations: these computations load elements 
from the input array and perform a small amount of leaf computation, leaving 
the result in a set of registers. The CNodeBf ly nodes represent computations 
in the body of the transform: these use a twiddle factor to perform a butterfly 
computation on a vector of registers, leaving the result in the same registers. 


Type Size Addresses Registers Twiddle 
CNodeLoad 4 {0,4,2,6} {0,1,2,3} 


CNodeLoad 2(x2) {1,5,7,3} {4,5,6,7} 


CNodeBfly A {0,2,4,6} w? 
CNodeBfly A {1,3,5,7} Ws 


VL-1 size-8 conjugate-pair transform nodes 


The constructor for a CNodeLoad object computes input array addresses for 
the load operations using the input array offset (Lof fset), the input array 
stride, the size of the node (the nodes instantiated at lines 9 and 17 are size- 
4, and the node instantiated at line 20 is size-2) and a final parameter that is 
non-zero if the node is a single node (the nodes instantiated at lines 17 and 20 
are single nodes, while the node instantiated at line 9 is composed of two size-2 
nodes). 


As the newly instantiated CNodeLoad objects are appended to the back of ns 
at lines 10, 14 and 21, the assign_leaf_registers function assigns 
registers to the outputs of each instance. Registers are identified with integers 
beginning at zero, and when each register is created it is assigned an identifier 
from an auto-incrementing counter (Reoynter). This function also maintains a 
map of registers to node pointers, referred to as rmap, where the node for a 
given register is the last node to reference that register. 


The constructor for a CNodeBf Ly object uses k and stride to compute a 
twiddle factor for the new instance of a butterfly computation node. When the 
new instance of CNodeBf Ly is appended to the end of ns at line 14, the 
assign_body_registers function assigns registers R; to a node of size 
Nnoae With the following logic: 

Equation: 


N 
Ri; = Reouter — N+ k+i x A 


where 7 = 0,---, Nnoae — 1 and Reounter is the auto-incrementing register 
counter. The asSign_body_registers functions also updates the map of 
registers to node pointers by setting rmap|R,] to point to the new instance of 
CNodeBf ly. 


Emitting code 


void sfft_dcf8_hc(sfft_plan_t *p, const void *vin, 
void *vout) { 
const SFFT_D *in = vin; 
SFFT_D *out = vout; 
SFFT_R r0,ri,r2,r3,r4,r5,r6,r7; 


L_4(in+0,in+8,in+4,1in+12,&r0,&r1,&r2,&r3); 
L_2(in+2,1in+10, 1n+14,in+6,&r4,&r5,&r6,&r7); 
K_O(&r0,&r2,&r4,&r6); 
S_4(r0,r2,r4,r6, out+O, out+4, out+8, out+12) ; 
K_N(VLIT2(0.7071,0.7071),VLIT2(0.7071, -0.7071),&r 
1, &r3,&r5,&r7); 
S_4(r1,r3,r5,r7, out+2, out+6, out+10, out+14); 


Hard-coded VL-1 size-8 FFT 


Given a list of nodes, it is a simple process to emit C code that can be compiled 
to actually compute the transform. 


The example in [link] would be emitted from the list of four nodes in [link]. 
Lines 1-4 are emitted from a function that generates a header, and line 13 is 
emitted from a function that generates a footer. Lines 6-11 are generated based 
on the list of nodes. 


[link] contains references to several types, functions and macros that use upper- 
case identifiers — these are primitive functions or types that have been 
predefined as inline functions or macros. A benefit of using primitives in this 
way is that the details specific to numerical representation and the underlying 
machine have been abstracted away; thus, the same function can be compiled 
for a variety of types and machines by simply including a different header file 
with different primitives. [link], for example, could be compiled for double- 
precision arithmetic on an SSE2 machine by including sse_double.h, or it 
could be compiled with much slower scalar arithmetic by including 

scalar .h. The same code can even be used, without modification, to 
compute forward and backwards transforms, by using C preprocessor directives 
to conditionally alter the macros. 


In order to accommodate mixed numerical representations, the signature of the 
outermost function references data with void pointers. In the case of the double- 
precision example in [link], SFFT_D would be defined to be doub1e in the 
appropriate header file, and the void pointers are then cast to SFFT_D pointers. 


The size-8 transform in [link] uses 8 registers, and thus a declaration of 8 
registers of type SFFT_R has been emitted at line 4 in [link]. In the case of 
double-precision arithmetic on a SSE2 machine, SFFT_R is defined as 
Eemi26d in ssexdouble=h. 


The first two rows of [link] correspond to lines 6 and 7 of [link], respectively. 
The L_4 primitive is used to compute the size-4 leaf node in the first row of the 
table. The second row is a load/leaf node of size 2(x2), indicating two size-2 
nodes in parallel, which is computed with the L_2 primitive. The input 
addresses in the table are the addresses of complex words, while the addresses 
in the generated code refer to the real and imaginary parts of a complex word, 
and thus the addresses from [link] are multiplied by a factor of 2 to obtain the 
addresses in [link]. 


The final two CNodeBf Ly nodes of [link] correspond to the K_O and K_N 
sub-transform (a.k.a. butterfly) primitives at lines 8 and 10, respectively. 
Because the node in the third row of [link] has a twiddle factor of we (i.e., 
unity), the computation requires no multiplication, and the K_O primitive is 
used for this special case. The K_N primitive at line 10 does require a twiddle 
factor, which is passed to K_N as two vector literals that represent the twiddle 
factor in unpacked form.[footnote] Fast interleaved complex multiplication 
describes how interleaved complex multiplication is faster if one operand is 
pre-unpacked. 

For the purposes of brevity, the precision has been truncated to only a few 
decimal places. 


After each node is processed, the registers that have been used by it are checked 
in a map (rmap) that maps each register to the last node to have used that 
register. If the current node is the last node to have used a register, the register 
is stored to memory. In the case of the transform in [link], four registers are 
stored with an instance of the S_4 primitive at lines 9 and 11. In contrast to the 
load operations at the leaves, which are decimated-in-time and thus effectively 
pseudo-random memory accesses, the store operations are to linear regions of 


memory, the addresses of which can be determined from each register's integer 
identifier. The store address offset for data in register R; is simply z x 2 x VL. 


Other vector lengths 


If VL > 1, the list of nodes that results from the eLaborate function in 
[link] is vectorized. Broadly speaking, CNodeLoad objects that operate on 
adjacent memory locations are collected together and computed in parallel. 
After each such computation, each position in a vector register contains an 
element that belongs to a different node. Transposes are then used to transform 
sets of vector registers such that each register contains elements from one node. 
Finally, the CNodeBf Ly objects can be easily computed in parallel, as they 
were with VL-1 because the elements in each vector register correspond to one 
node. 


Overview 


[link] lists the nodes that represent a VL-1 size-16 transform. A VL of 2 implies 
that each vector register contains 2 complex words, and load operations on each 
of the 4 addresses in the first row of [link] will also load the complex words in 
the adjacent memory locations. Note that the complex words that would be 
incidentally loaded in the upper half of the VL-2 registers are the complex 
words that the third CNodeLoad object at row 5 would have loaded. This is 
exploited to load and compute the first and third CNodeLoad objects in 
parallel. 


Type Size Addresses Registers Twiddle 
CNodeLoad 4 {0,8,4,12} {0,1,2,3} 


CNodeLoad 2(x2) {2,10,14,6} {4,5,6,7} 


CNodeBfly 4 {0,2,4,6} Wie 


CNodeBfly 4 {1,3,5,7} Wig 
CNodeLoad 4 {1,9,5,13} {8,9,10,11} 
CNodeLoad 4 {15,7,3,11} {12,13,14,15} 
CNodeBfly 4 {0,4,8,12} wis 
CNodeBfly 4 {1,5,9,13} Wig 
CNodeBfly 4 {2,6,10,14} Wig 
CNodeBfly 4 {3,7,11,15} wt 


VL-1 size-16 conjugate-pair transform nodes 


Type Sizes Addresses Registers Twiddles 
{{0,1},{8,9}, {{0,1},{2,3}, 
Load {4,4} {4,5}, {8,9}, 
{12,13}} {10,11}} 
oo {{4,5},{6,7}, 
Load {2(x2),4} ea {14,15}, 
{14,15}, {12,13}} 
{6,7} } , 


{{0,1},{2,3}, | (9, 


Bfly {4,4} {4,5},{6,7}} Wig } 


Bfly {4,4} {{0,1},{4,5}, 


(8,91. { w!,, 


{12,13}} wi, } 
{{2,3}{6,7} | 42 

Bfly {4,4} {10,11}, 3 7 
{14,15}} “16 


VL-2 size-16 conjugate-pair transform nodes 


The second CNodeLoad object computes two size-2 leaf transforms in 
parallel, while the last CNodeLoad object computes a size-4 leaf transform. 
Because the size-4 transform is composed of two size-2 transforms, and 
memory addresses of the fourth CNodeLoad are adjacent (although 
permuted), some of the computation can be computed in parallel. 


If the CNodeLoad objects at rows 1 and 5 are computed in parallel, the output 
will be four VL-2 registers: { {0,8}, {1,9}, {2,10}, {3,11}} —i.e., the first 
register contains what would have been register 0 in the lower half, and what 
would have been register 8 in the top half etc. Similarly, computing rows 2 and 
6 in parallel would yield four VL-2 registers: {{4,14}, {5,15}, {6,12}, {7,13}} 
— note the permutation of the upper halves in this case. These registers are 
transposed to {{0,1}, {2,3}, {8,9}, {10,11}} and {{4,5}, {6,7}, {14,15}, 
{12,13}}, as in row 1 and 2 of [link]. 


With the transposed VL-2 registers, it is now possible to compute CNodeBfly 
nodes in parallel. For example, rows 2 and 3 of [link] can be computed in 
parallel on four VL-2 registers represented by {{0,1}, {2,3}, {4,5}, {6,7}}, as 
in row 3 of [link]. 


Implementation 


[link] is a C++ implementation of the vectorize_loads function. This 
function modifies a topological ordering of nodes (the class member variable 
ns) and uses two other functions: Find_parallel_loads, which searches 
forward from the current node to find another CNodeLoad that shares adjacent 
memory addresses; and merge_loads(a, b), which adds the addresses, 
registers and type of b to a. Type introspection is used at lines 7 and 36 (and in 
other Listings), to differentiate between the two types of object. 


[link] is a C++ implementation of the vectorize_ks function. For each 
CNodeBf Ly node, the function searches forward for another CNodeBf Ly that 
does not have a register dependence. Once found, the registers of the latter node 
are added to the former node, and the latter node erased. Finally, at line 19, the 
registers of the vectorized CNodeBf ly node are merged using a perfect 
shuffle, which is then recursively applied on each half of the list. The effect is a 
merge that works for any power of 2 vector length. 


void CSplitRadix::vectorize_ks() { 
vector<CNodeHardCoded *>::iterator 1; 
for(i=ns.begin(); 1 != ns.end();++1) { 
if(!(*1)->type().compare(° “blockbfly'')) { 
vector<CNodeHardCoded *>::iterator j = it+1, p 
Jl, 
int count = 1; 
while(j != ns.end() && count < VL) { 
PC) = 
>type().compare( “blockbfly'') && !register_dependenc 
e(*i, *Jj)) { 
(*1)->rs.insert( (*1)->rs.end(), (*Jj)- 
>rs.begin(), (*j)->rs.end()); 
ns.erase(j); 
count++; 
j = pjtt,; 
telse { 
pJ) = J; ++; 
t 
J 


(*1)->merge_rs(); 
} 


} 
} 
Body node vectorization 
CNodeLoad * 
CSplitRadix: :find_parallel_load(vector<CNodeHardCod 
ed*>::iterator i){ 
CNodeLoad *b = (CNodeLoad *)(*1); 
for(int k=0;k<((N>2)?4:2);k++) { 


vector<CNodeHardCoded *>::iterator j = i+1; 
while(j != ns.end()) { 
if(!(*j)->type().compare(° “blockload'')) { 
CNodeLoad *b2 = (CNodeLoad *)(*j); 
if(b2->iaddrs[k] > b->iaddrs[0] && b2- 
>laddrs[k] < b->iaddrs[O]+VL) { 
ns.erase(j); 
return b2; 
a 
++); 
} 
i 


} 
return NULL; 


} 
void CSplitRadix: :merge_loads(CNodeLoad *b1, CNodeL 
oad *b2) { 
for(int 1=0;1i<bi->size;i++) { 
for(int j=0;j<b2->i1addrs.size();jt++) { 
if(b2->iaddrs[j] > b1->iaddrs[i] && b2- 
>iaddrs[j] < bi->iaddrs[i]+VL) { 
bi->1addrs.push_back(b2->1addrs[j]); 
b1i->rs.push_back(b2->rs[j]); 
if(rmap[b2->rs[j]] == b2) rmap[b2- 
>rs[j]] = b1; 
iz 


} 
} 
bi->types.push_back(b2->types[0]); 


void CSplitRadix::vectorize_loads() { 
vector<CNodeHardCoded *>::iterator 1; 
for(i=ns.begin(); 1 != ns.end();++1) { 
if(!(*1)->type().compare(° “blockload'')) { 
while(CNodeLoad *b2 = find_parallel_load(1i)) 
merge _loads((CNodeLoad *)(*1), b2); 


} 


Leaf node vectorization 


If vectorize_ loads and vectorize_ks are invoked with VL = 2 on 
the topological ordering of nodes in [link], the result is the vectorized node list 
shown in [link]. As in "Emitting code", emitting code is a fairly simple process, 
and [link] is the code emitted from the node list in [link]. There are only a few 
differences to note about the emitted code when VL > 1. 


1. The register identifiers in line 4 of [link] consist of a list of two integers 
delimited with an underscore. The integers listed in each register's name 
are the VL-1 registers that were subsumed to create the larger register (cf. 
VL-1 code in [link]); 

2. The leaf primitives (lines 6 and 7 in [link]) have a list of underscore 
delimited integers in the name, where each integer corresponds to the type 
of sub-transform to be computed on that position in the vector registers. 
For example, the L_4_4 primitive is named to indicate a size-4 leaf 
operation on the lower and upper halves of the vector registers, while the 
L_2_4 primitive performs two size-2 leaf operations on the lower half of 
the registers and a size-4 leaf operation on the upper halves; 

3. The body node primitives (K_N) and store primitives (S_4) are unchanged 
because they perform the same operation on each element of the vector 
registers. This is as a result of the register transposes that were previously 
performed on the outputs of the leaf primitives. 


void sfft_fcfi6_hc(sfft_plan_t *p, const void *vin, v 
oid *vout) { 

const SFFT_D *in = vin; 

SFFT_D *out = vout; 

SFFT_R rQ_1,r2_3,r4_5, r6_7,r8_9, ri0_11,r12_13,r14_1 


So; 

L_4 4(in+0, in+16, in+8, 1n+24, &r0_1,&r2_3,&r8_9, &r10_ 
11); 

L_2 4(in+4, in+20, in+28, in+12, &r4_5, &r6_7,&r14_15, &r 
12-13); 


K_N(VLIT4(0.7071,0.7071,1,1), 
VLIT4(0.7071, -0.7071,0, -0), 
&rO_1,&r2_3,&r4_5, &r6_7); 


K_N(VLIT4(0.9239,0.9239,1,1), 
VLIT4(0.3827, -0.3827,0,-0), 
&rO_1,&r4_5,&r8_9,&r12_13); 
S_4(r0_1, r4_5, r8_9, r12_13, out+0, out+8, out+16, out+24 
); 
K_N(VLIT4(0.3827,0.3827,0.7071,0.7071), 
VLIT4(0.9239, -0.9239,0.7071, -0.7071), 
&r2_3,&r6_7,&r10_11, &r14_15); 
S_4(r2_3, r6_7, r10_11, r14_15, out+4, out+12, out+20, out 
+28); 


} 
Hard-coded VL-2 size-16 FFT 


Scalability 


So far, hard-coded transforms of vector length 1 and 2 have been presented. On 
Intel machines, VL-1 can be used to compute double-precision transforms with 
SSE2, while VL-2 can be used to compute double-precision transforms with 
AVX and single-precision transforms with SSE. The method of vectorization 
presented in this chapter scales above VL-2, and has been successfully used to 
compute VL-4 single-precision transforms with AVX. 


The leaf primitives were coded by hand in all cases; VL-1 required L_2 and 
L_4, while VL-2 required L_2_2, L_2_4,L_4 2 and L_4_4. In the case of 
VL-4, not all permutations of possible leaf primitive were required — only 11 
out of 16 were needed for the transforms that were generated. 


It is an easy exercise to code the leaf primitives for VL < 4 by hand, but for 
future machines that might feature vector lengths larger than 4, the leaf 
primitives could be automatically generated (in fact, "Other vector lengths" is 
concerned with automatic generation of leaf sub-transforms at another level of 
scale). 


Constraints 


For a transform of size N and leaf node size of S (S = 4 in the examples in 
this chapter), the following constraint must be satisfied: 


Equation: 


N/VL>S 


If this constraint is not satisfied, the size of either VL or S must be reduced. In 
practice, VL and S are small relative to the size of most transforms, and thus 
these corner cases typically only occur for very small sized transforms. Such an 
example is a size-2 transform when VL = 2 and S = 4, where in this case the 
transform is too small to be computed with SIMD operations and should be 
computed with scalar arithmetic instead. 


Performance 
Single- Double- Single- Double- 
precision, precision, SSE precision, precision, AVX 
SSE (VL-2) (VL-1) AVX (VL-4) (VL-2) 


Performance of hard-coded FFTs on a Macbook Air 4,2. 


[link] shows the results of a benchmark for transforms of size 4 through to 1024 
running on a Macbook Air 4,2. The speed of FFTW 3.3 running in estimate and 
patient modes is also shown for comparison. 


FFTW running in patient mode evaluates a huge configuration space of 
parameters, while the hard-coded FFT required no calibration. 


A variety of vector lengths are represented, and the hard-coded FFTs have good 
performance while N/V LZ < 128. After this point, performance drops off and 
other techniques should be used. The following sections use the hard-coded 
FFT as a foundation for scaling to larger sizes of transforms. 


Hard-coded four-step 


This section presents an implementation of the four-step algorithm [link] that 
leverages hard-coded sub-transforms to compute larger transforms. The 
implementation uses an implicit memory transpose (along with vector register 
transposes) and scales particularly well with VL. In contrast to the fully hard- 
coded implementation in the previous section, the four-step implementation 
requires no new leaf primitives as VL increases, i.e., the code is much the same 
when VEZ > lasitis when VZ = 1. 


The four-step algorithm 


A transform of size N is decomposed into a two-dimensional array of size 

m1 X Ng where N = n1nzo. Selecting ny = ng = VN (or close) often obtains 
the best performance results [link]. When either of the factors is larger than the 
other, it is the larger of the two factors that will determine performance, 
because the larger factor effectively brings the memory wall closer. The four 
steps of the algorithm are: 


1. Compute n; FFTs of length n2 along the columns of the array; 

2. Multiply each element of the array with W4 where 2 and 7 are the array 
coordinates; 

3. Transpose the array; 

4. Compute nz FFTs of length n,; along the columns of the array. 


For this out-of-place implementation, steps 2 and 3 are performed as part of 
step 1. Step 1 reads data from the input array and computes the FFTs, but 
before storing the data in the final pass, it is multiplied by the twiddle factors 
from step 2. After this, the data is stored to rows in the output array, and thus 
the transpose of step 3 is performed implicitly. Step 4 is then computed as 
usual: FFTs are computed along the columns of the output array. 


This method of computing the four-step algorithm in two steps requires only 
minor modifications in order to support multiple vector lengths: with VZ > 1, 
multiple columns are read and computed in parallel without modification of the 
code, but before storing multiple columns of data to rows, a register transpose 
is required. 


Vector length 1 
When V L = 1, three hard-coded FFTs are elaborated. 


1. FFT of length nz with stride n, x 2 for the first column of step 1; 

2. FFT of length n2 with stride n; x 2 and twiddle multiplications on 
outputs — for all other columns of step 1; 

3. FFT of length n,; with stride nz x 2 for columns in step 4. 


In order to generate the code for the four-step sub-transforms, some minor 
modifications are made to the fully hard-coded code generator that was 
presented in the previous section. 


The first FFT is used to handle the first column of step 1, where there are no 
twiddle factor multiplications because one of the array coordinates for step 2 is 
zero, and thus Be is unity. This FFT may be elaborated as in "Vector length 1" 
with the addition of a stride factor for the input address calculation. The second 
FFT is elaborated as per the first FFT, but with the addition of twiddle factor 
multiplications on each register prior to the store operations. The third FFT is 
elaborated as per the first FFT, but with strided input and output addresses. 


const SFFT_D __attribute__ ((aligned(32))) *LUT; 
const SFFT_D *pLUT; 
void sfft_dcf64_fs_x1_O0(sfft_plan_t *p, const void *v 
in, void *vout){ 
const SFFT_D *in = vin; 
SFFT_D *out = vout; 
SFFT_R r0O,ri,r2,r3,r4,r5,r6,r7; 
L_4(in+0, in+64, 1n+32,1n+96,&r0,&r1,&r2,&r3); 
L_2(in+16, in+80, i1n+112,1n+48,&r4,&r5,&r6,&r7); 
K_O(&r0,&r2,&r4,&r6); 
S_4(r0,r2,r4,r6,out+0, out+4, out+8, out+12); 


K_N(VLIT2(0.7071, 0.7071), VLIT2(0.7071, -©.7071),&r1, 


&r3,&r5,&r7); 


S_4(r1,r3,r5,r7,out+2, out+6, out+10, out+14); 


void sfft_dcf64_fs_xi_n(sfft_plan_t *p, const void *v 
in, void *vout){ 


const SFFT_D *in = vin; 

SFFT_D *out = vout; 

SFFT_R r0,ri,r2,r3,r4,r5,r6,r7; 

L_4(in+0, 1n+64, in+32,in+96,&r0,&r1,&r2,&r3); 
L_2(in+16, in+80, in+112,1n+48, &r4,&r5,&r6, &r7); 
K_O(&r0,&r2,&r4,&r6); 

r2 = MUL(r2,LOAD(pLUT+4), LOAD(pLUT+6) ); 

r4 = MUL(r4, LOAD(pLUT+12), LOAD(pLUT+14) ); 

r6 = MUL(r6, LOAD(pLUT+20), LOAD(pLUT+22) ); 
S_4(r0,r2,r4,r6,out+0, out+4, out+8, out+12); 
K_N(VLIT2(0.7071, 0.7071), VLIT2(0.7071, -0.7071),&r1, 


&r3,&r5,&r7); 
ri = MUL(ri, LOAD(pLUT+0), LOAD(pLUT+2) ); 
r3 = MUL(r3, LOAD(pLUT+8), LOAD(pLUT+10) ); 
r5 = MUL(r5,LOAD(pLUT+16), LOAD( pLUT+18) ); 
r7 = MUL(r7,LOAD(pLUT+24), LOAD( pLUT+26) ); 


S_4(r1,r3,r5,r7,out+2, out+6, out+10, out+14); 
pLUT += 28; 


void sfft_dcf64_fs_x2(sfft_plan_t *p, const void *vin 


I 


void *vout){ 

const SFFT_D *in = vin; 

SFFT_D *out = vout; 

SFFT_R r0,ri,r2,r3,r4,r5,r6,r7; 

L_4(in+0, 1n+64, in+32,1in+96,&r0,&r1,&r2,&r3); 
L_2(in+16, in+80, in+112,1n+48, &r4,&r5,&r6, &r7); 
K_O(&r0,&r2,&r4,&r6); 
S_4(r0,r2,r4,r6, out+0, out+32, out+64, out+96) ; 
K_N(VLIT2(0.7071, 0.7071), VLIT2(0.7071, -0.7071),&r1, 


&r3,&r5,&r7); 


S_4(r1,r3,1r5,r7,out+16, out+48, out+80, out+112) ; 


} 
void sfft_dcf64_fs(sfft_plan_t *p, const void *vin, v 


oid *vout) { 

const SFFT_D *in = vin; 

SFFT_D *out = vout; 

pLUT = LUT; 

int 1; 

sfft_dcf64_fs_x1_0(p, in, out); 

for(1=1;1<8;1i++) sfft_dcf64_fs_x1_n(p, int 
(1*2), out+(i*16)); 

for(1=0;1<8;it+) sfft_dcf64_fs_x2(p, out+ 
(1*2), out+(1i*2)); 


Hard-coded four-step VL-1 size-64 FFT 


Example 


[link] is a VL-1 size-64 hard-coded four-step FFT. Before it can be used, an 
initialization procedure (not shown) allocates and populates the LUT at line 1 
with the twiddle factors that are required for the step 2 multiplications. Line 44 
shows the main function that executes the first sub-transform on the first 
column (line 49), and the second sub-transform on all remaining columns (line 
50). Finally, the sub-transforms corresponding to step 4 of the four-step 
algorithm are executed on all columns in line 51. 


The twiddle factor multiplication that corresponds to step 2 of the four-step 
algorithm takes place in lines 21-23 and lines 26-29. The first register is not 


multiplied with a twiddle factor because the first row of twiddle factors are wh 


(i.e., unity). The other registers are multiplied with two registers loaded from 
the LUT, which are the unpacked real and imaginary parts (see Fast interleaved 
complex multiplication for details about unpacked complex multiplication). 


Other vector lengths 


const SFFT_D __attribute_ ((aligned(32))) *LUT; 
const SFFT_D *pLUT; 

void sfft_fcf64_fs_xi(sfft_plan_t *p, const void *vin 
, void *vout) { 


const SFFT_D *in = vin; 

SFFT_D *out = vout; 

SFFT_R r0,ri,r2,r3,r4,r5,r6,r7; 

L_4(in+0, 1n+64, in+32,1in+96,&r0,&r1,&r2,&r3); 

L_2(1in+16, in+80, in+112,1n+48,&r4,&r5,&r6, &r7); 

K_O0(&r0,&r2,&r4,&r6); 

K_N(VLIT4(0.7071,0.7071,0.7071,0.7071), 
VLIT4(0.7071, -@.7071, 0.7071, -0.7071),&r1, &r3, &r 


5,&r7); 


ri = MUL(ri, LOAD(pLUT+0),LOAD(pLUT+4) ); 
TX2(r0,r1); 

r2 = MUL(r2,LOAD(pLUT+8), LOAD(pLUT+12) ); 
r3 = MUL(r3,LOAD(pLUT+16), LOAD(pLUT+20) ); 
TX2(r2,13); 

r4 = MUL(r4, LOAD(pLUT+24), LOAD(pLUT+28) ); 
r5 = MUL(r5, LOAD(pLUT+32), LOAD(pLUT+36) ); 
TX2(r4,r5); 

r6 = MUL(r6,LOAD(pLUT+40), LOAD(pLUT+44) ); 
r7 = MUL(r7,LOAD(pLUT+48),LOAD(pLUT+52) ); 
TX2(r6,r7); 
S_4(r0,r2,r4,r6,out+0, out+4, out+8, out+12); 
S_4(r1,r3,r5,r7,out+16, out+20, out+24, out+28); 
pLUT += 56; 


void sfft_fcf64_fs_x2(sfft_plan_t *p, const void *vin 


/ 


void *vout) { 

const SFFT_D *in = vin; 

SFFT_D *out = vout; 

SFFT_R r0O,ri,r2,r3,r4,r5,r6,r7; 

L_4(in+0, i1n+64, in+32,in+96,&r0,&r1,&r2,&r3); 

L_2(in+16, in+80, in+112,1n+48,&r4,&r5,&r6, &r7); 

K_O0(&r0,&r2,&r4,&r6); 

K_N(VLIT4(0.7071,0.7071,0.7071,0.7071), 
VLIT4(0.7071, -0.7071,0.7071, -0.7071),&r1, &r3,&r 


5,&r7); 


S_4(r0,r2,r4,r6, out+0, out+32, out+64, out+96) ; 
S_4(r1,r3,r5,r7,out+16, out+48, out+80, out+112) ; 


} 
void sfft_fcf64_fs(sfft_plan_t *p, const void *vin, v 


oid *vout) { 

const SFFT_D *in = vin; 

SFFT_D *out = vout; 

pLUT = LUT; 

int 1; 

for(1=0;1<4;i++) sfft_fcf64_fs_xi(p, in+(1i*4), out+ 
(1*32)); 

for(1=0;1<4;1++) sfft_fcf64_fs_x2(p, out+ 
(1*4), out+(1i*4)); 


Hard-coded four-step VL-2 size-64 FFT 


For VL > 1, the FFTs along the columns are computed in parallel. Thus, in 
step 1, n;/V L FFTs are computed along the columns of the array with stride 
= 2 x VE, and in step 4, n2/V L FFTs are computed along the columns with 
stride = 2 x VL. 


An implication of computing the first column in parallel with other columns is 
that the first column is now multiplied by unity twiddle factors, and thus only 
two sub-transforms are used instead of three. 


The only other difference when VL > 1 is that the registers need to be 
transposed before storing columns to rows (the implicit transpose that 
corresponds to step 3). To accomplish this when generating code,n = VL 
store operations are latched before the transpose and store code is emitted. 


Example 


[link] implements a VL-2 size-64 hard-coded four-step FFT. The main function 
(line 39) computes 8 FFTs along the columns for step 1 at line 44, and 8 FFTs 
along the columns for step 4 at line 45. There are only 4 iterations of the loop in 
each case because two sub-transforms are computed in parallel with each 
invocation of the sub-transform function. 


In the function corresponding to the sub-transforms of step 1 (line 3), two store 
operations are latched (lines 23 and 24) before emitting code, which includes 
the preceding transposes (the TX2 operations) and twiddle factor 
multiplications (lines 13-22). 


Performance 
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Performance of hard-coded four-step FFTs on a Macbook Air 4,2. 


[link] shows the results of a benchmark for transforms of size 16 through to 
8192 running on a Macbook Air 4,2. The speed of FFTW 3.3 running in 
estimate and patient modes is also shown for contrast. 


The results show that the performance of the four-step algorithm improves as 
the length of the vector increases, but, as was the case with the hard-coded 
FFTs in "Fully hard-coded", the performance of the hard-coded four-step FFTs 
is limited to a certain range of transform size. 


Hard-coded leaves 


The performance of the fully hard-coded transforms presented in "Fully hard- 
coded" only scales while N/V LZ < 128. This section presents techniques that 
are similar to those found in the fully hard-coded transforms, but applied at 
another level of scale in order to scale performance to larger sizes. 


Vector length 1 


The fully hard-coded transforms in ‘Fully hard-coded" used two primitives at 
the leaves: a size-4 sub-transform (L_4) and a double size-2 sub-transform 
(L_2). These sub-transforms loaded four elements of data from the input array, 
performed a small amount of computation, and stored the four results to the 
output array. 


Performance is scaled to larger transforms by using larger sub-transforms at the 
leaves of the computation. These are automatically generated using fully hard- 
coded transforms, and thus the size of the leaf computations is easily 
parametrized, which is just as well, because the optimal leaf size is dependent 
on the size of the transform, the compiler, and the target machine. 


The process of elaborating a topological ordering of nodes representing a hard- 
coded leaf transform of size V with leaf sub-transforms of size Nieaf is as 
follows: 


1. Elaborate a size Nieqgf sub-transform; 

2. Elaborate a two size Nica /2 sub-transforms as one sub-transform; 

3. Elaborate the main transform using the sub-transforms from steps 1 and 2 
as the leaves of the computation. 


The node lists for steps 1 and 2 are elaborated using the fully hard-coded 
elaborate function from [link], but because the leaf sub-transform in step 2 
is actually two sub-transforms of size Nieaf/2, the elaborate function is 
invoked twice with different offset parameters: 


1. elaborate(Nijeq 7/2, 0, 0, 1) ; 
2. elaborate(Nica/2, —1, Nicaz/2, 1) ; 


The code corresponding to steps 1 and 2 is emitted slightly differently than was 
the case with the fully hard-coded transforms. Instead of hard coding the input 
array indices, the indices are themselves loaded from an array that is 
precomputed when the transform is initialized. 


The node list corresponding to the main transform in step 3 is elaborated as in 
the function in [link], but with some minor change. First, the recursion 
terminates with leaf nodes of size Nica. Second, because the loops in the body 
of the sub-transform will be at least 2 x Niea# iterations, the loop for the body 


sub-transforms (line 12 of [link]) is not statically unrolled. Instead only one 
node is added to the list of nodes, and the loop is computed dynamically. 


void sfft_dcf64_hcl16_4 e(offset_t *is,const SFFT_D * 
in, SFFT_D *out){ 
SFFT_R r0,ri,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11,r12,r1 
3,714,715; 
L_4(int+is[0],1in+is[1],int+is[2],int+tis[3],&r0,&r1,&r2 
7&3); 
L_2(int+is[4],1in+is[5],int+is[6],int+is[7],&r4,&r5, &r6 
,&r7); 
K_O0(&r0,&r2,&r4,&r6); 
K_N(VLIT2(0.7071, 0.7071),VLIT2(0.7071, -0.7071),&r1, 
&r3,&r5,&r7); 
L_4(in+is[8],int+is[9],in+is[10],in+is[11],&r8,&r9,& 
r10,&r11); 
L_4(in+is[12],in+is[13],in+is[14],in+is[15],&r12,&r 
13,&r14,&r15); 
K_O0(&r0,&r4,&r8,&r1i2); 
S_4(r0,r4,r8,r1i2, out+0, out+8, out+16, out+24); 
K_N(VLIT2(0.9239,0.9239),VLIT2(0.3827, -0.3827),&r1, 
&r5,&r9,&r13); 
S_4(r1,r5,r9,r13, out+2, out+10, out+18, out+26); 
K_N(VLIT2(0.7071,0.7071),VLIT2(0.7071, -0.7071),&r2, 
&r6,&r10,&r14); 
S_4(r2,r6,r10, r14, out+4, out+12, out+20, out+28) ; 
K_N(VLIT2(0.3827,0.3827),VLIT2(0.9239, -0.9239),&r3, 
&r7,&r11,&r15); 
S_4(r3,r7,r11, 715, out+6, out+14, out+22, out+30); 


i 
void sfft_dcf64_hcl16_4 o(offset_t *is,const SFFT_D * 
in, SFFT_D *out){ 
SFFT_R r0O,ri,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11,r12,r1 
3,714,115; 
L_4(in+is[0],int+is[1],in+is[2],in+is[3],&r0,&r1, &r2 
7&3); 
L_2(int+is[4],1in+is[5],int+is[6],int+is[7],&r4,&r5, &r6 
,&r7); 
K_O(&r0,&r2,&r4,&r6); 


S_4(r0,r2,r4,r6,out+0, out+4, out+8, out+12); 

K_N(VLIT2(0.7071, 0.7071), VLIT2(0.7071, -0.7071),&r1, 
&r3,&r5,&r7); 

S_4(r1,1r3,r5,r7,out+2, out+6, out+10, out+14); 

L_4(int+is[8],in+is[9],int+is[10],int+is[11],&r8,&r9,& 
r10,&r1i1); 

L_2(in+is[12],in+is[13],in+is[14],in+is[15],&r12,&r 
13,&r14,&r15); 

K_O0(&r8,&r10,&r12,&r14); 

S_4(r8,r10, r12,r14, out+16, out+20, out+24, out+28) ; 

K_N(VLIT2(0.7071, 0.7071), VLIT2(0.7071, -0.7071),&r9, 
&ri1,&r13,&r15); 

S_4(r9,r11, 713, r15, out+18, out+22, out+26, out+30) ; 


void sfft_dcf64_hcl116_4 X_4(SFFT_D *data, int N, SFFT 
_D *LUT){ 
X_4(data, N, LUT); 


void sfft_dcf64_hcl116_4(sfft_plan_t *p, const void *v 
in, void *vout){ 
const SFFT_D *in = vin; 
SFFT_D *out = vout; 
p->is = p->is_base; 
sfft_dcf64_hcl16_4 e(p->1is,in,out+0); 
p->1s += 16; sfft_dcf64_hcl116_4 o(p- 
>1s,in, out+32); 
sfft_dcf64_hc116_4 X_4(out+0, 32, p->ws[0]); 
p->1s += 16; sfft_dcf64_hcl1i6_4 e(p- 
>1s,in, out+64); 
p->1s += 16; sfft_dcf64_hcli6_4 e(p- 
>1s,in,out+96); 
sfft_dcf64_hc116_4 X_4(out+0, 64, p->ws[1]); 
} 
Hard-coded VL-1 size-64 FFT with size-16 leaves 


Example 


[link] is a size-64 hard-coded leaf transform with size-16 leaves. The first 
function (lines 1-17) is a size-16 leaf sub-transform, while the second (lines 
18-32) consists of two size-8 leaf sub-transforms in parallel. The main function 
(lines 36-46) invokes four leaf sub-transforms (lines 40, 41, 43 and 44), and 
two loops of body sub-transforms (lines 42 and 45). 


The first parameter to the leaf functions (see lines 1 and 18) is a pointer into an 
array of precomputed indices for the input data array. At lines 41 and 43—44, 
the array is incremented before subsequent calls to the leaf functions, and at 
line 39 the pointer is reset to the base of the array so that the transform can be 
used repeatedly. 


The function used for the body sub-transforms (lines 33-35) is a wrapper for a 
primitive that computes a radix-2/4 butterfly. The last parameter to this function 
is a pointer to a precomputed LUT of twiddle factors for a sub-transform of size 
N (the second parameter). 


Improving memory locality in the leaves 


Size Input array addresses 
16 {0, 64, 32, 96, 16, 80, 112, 48, 8, 72, 40, 104, 120, 56, 24, 88} 
{4, 68, 36, 100, 20, 84, 116, 52, 124, 60, 28, 92, 12, 76, 108, 
8(x2) 
44} 
{2, 66, 34, 98, 18, 82, 114, 50, 10, 74, 42, 106, 122, 58, 26, 
16 
90} 
16 {126, 62, 30, 94, 14, 78, 110, 46, 6, 70, 38, 102, 118, 54, 22, 


86} 


16 {1, 65, 33, 97, 17, 81, 113, 49, 9, 73, 41, 105, 121, 57, 25, 89} 


8(x2) {5, 69, 37, 101, 21, 85, 117, 53, 125, 61, 29, 93, 13, 77, 109, 


45} 

{127, 63, 31, 95, 15, 79, 111, 47, 7, 71, 39, 103, 119, 55, 23, 
16 

87} 
8(x2) {3, 67, 35, 99, 19, 83, 115, 51, 123, 59, 27, 91, 11, 75, 107, 


43} 
Size-16 leaf nodes in VL-1 size-128 hard-coded leaf FFT 


[link] lists the addresses of data loaded by each of the size-16 leaf nodes in a 
size-128 transform. It is difficult to improve the locality of accesses within a 
leaf sub-transform (doing so would require the use of expensive transposes), 
but the order of the leaf sub-transforms can be changed to yield better locality 
between sub-transforms. 


Size Input array addresses 
16 {0, 64, 32, 96, 16, 80, 112, 48, 8, 72, 40, 104, 120, 56, 24, 88} 
16 {1, 65, 33, 97, 17, 81, 113, 49, 9, 73, 41, 105, 121, 57, 25, 89} 
{2, 66, 34, 98, 18, 82, 114, 50, 10, 74, 42, 106, 122, 58, 26, 
16 
90} 
(3; 67, 35, 99,19, 83, 115,51; 123, 59,27; 91, 11, 75; 107, 
8(x2) 43} 
8(x2) {4, 68, 36, 100, 20, 84, 116, 52, 124, 60, 28, 92, 12, 76, 108, 


44} 


8(x2) {5, 69, 37, 101, 21, 85, 117, 53, 125, 61, 29, 93, 13, 77, 109, 


45} 


16 {126, 62, 30, 94, 14, 78, 110, 46, 6, 70, 38, 102, 118, 54, 22, 
86} 

{127, 63, 31, 95, 15, 79, 111, 47, 7, 71, 39, 103, 119, 55, 23, 
16 87} 


Sorted size-16 leaf nodes in VL-1 size-128 hard-coded leaf FFT 


[link] is the list of nodes from [link] after the rows have been sorted according 
to the minimum address in each row. There are now three distinct groups in the 
list: the first three sub-transforms of size-16, the second three sub-transforms of 
2x size-8, and the final two sub-transforms of size-16. The memory accesses 
are now linear between consecutive sub-transforms, though the second and 
third groups operate on a permuted ordering of the addresses. 


The pattern exhibited by [link] can be exploited to access the data stored in the 
input array with better locality, as Figures [link] and [link] show. [link] depicts 
the memory access pattern of an FFT with size-16 hard-coded leaves, while 
[link] depicts the same FFT with sorted hard-coded leaves. 


To compute the FFT with sorted leaves, the leaf sub-transforms and the body 
sub-transforms are split into two separate lists, and the entire list of leaf sub- 
transforms is computed before any of the body sub-transforms. There is, 
however, a cost associated with this re-arrangement: each leaf sub-transform's 
offset into the output array is not easy to compute because the offsets are now 
essentially decimated-in-frequency, and thus they are now pre-computed. 
Overall, the trade-off is justified because the output memory accesses within 
each leaf sub-transform are still linear. 
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Memory access pattern of the straight line blocks of 
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Memory access pattern of the straight line blocks of 
code in a VL-1 size-128 hard-coded leaf FFT after 
leaf node sorting 


The leaf transforms can be computed in three loops. The first and third loops 
compute size-Njeq¢ sub-transforms, while the second loop computes size- 
Nica /2 sub-transforms. The size of the three loops io, 71 and @z are: 


Equation: 
N 
10 tes | a 1 
Equation: 
N N 1 
1) = a mod 3} x — 
3 xX Nieaf Nieaf 2 
and 
Equation: 


N 
122 = | ——————_ 
3 xX Nieaf 


The transform can now be elaborated without leaf nodes, and the code for the 
three loops emitted in the place of calls to individual leaf sub-transforms. 


Example 


[link] is the main function for the FFT that corresponds to the leaf node list in 
[link]. The first and third loops invoke size-16 sub-transforms at lines 8 and 16, 
and the second loop invokes 2x size-8 sub-transforms at line 12. Following the 
leaf sub-transforms, the body sub-transforms are called at lines 19-23. 


void sfft_dcfi28_ shli6_4(sfft_plan_t *p,const void *v 
in, void *vout){ 

const SFFT_D *in = vin; 

SFFT_D *out = vout; 

offset_t *1S = p->1is_base; 

offset_t *offsets = p->offsets_base; 

int 1; 


for(1=3;1>0;--1) { 
sfft_dcf128_shli6_4 e(is, in, outt+offsets[0]); 
is += 16; offsets += 1; 

} 

for(1=3;1>0;--1) { 
sfft_dcfi28_shli6_4 o(is, in, outt+offsets[0]); 
is += 16; offsets += 1; 

bs 

for(1=2;1>0;--1) { 
sfft_dcf128_shli6_4 e(is, in, outt+offsets[0]); 
is += 16; offsets += 1; 


i; 
sfft_dcf128_shli6_4 X_4(out+0, 32, p->ws[0]); 
sfft_dcf128_shli6_4 X_4(out+0, 64, p->ws[1]); 
sfft_dcf128_shli6_4 X_4(out+128, 32, p->ws[0]); 
sfft_dcf128_shli6_4 X_4(out+192, 32, p->ws[0]); 
sfft_dcf128_shl16_4 X_4(out+0, 128, p->ws[2]); 
} 
Hard-coded VL-1 size-128 FFT with size-16 leaves 
(sub-transforms omitted) 


Scalability 


In terms of code size, computing the leaf sub-transforms with three loops is 
economical. As the size of the transform grows, the code size attributed to the 
leaf sub-transforms remains constant. However, as the size of the transform 
begins to grow large (e.g., > 65, 536), the instructions required for the body 
sub-transform calls (lines 19-23 in [link]) begins to dominate the overall 
program size. "Optimizing the hierarchical structure" describes a method for 
compressing the code size of the body sub-transform calls while maintaining 
performance. 


Because the input array references between consecutive leaves are now linear, 
and like types of leaf sub-transforms are grouped together, it is now possible to 
compute several leaf sub-transforms in parallel, which is fully described in 
"Other vector lengths". 


Body sub-transform radix 


The radix of the body sub-transforms can be increased in order to reduce the 
number of passes over the data and make better use of the cache. In practice, 
the body sub-transform radix is limited by the associativity of the cache as the 
size of the transform increases. If the radix is greater than the associativity of 
the nearest level of cache in which a sub-transform cannot fit, there will be 
cache misses for every iteration of the sub-transform's loop, resulting in 
severely degraded performance. 


All Intel SIMD microprocessors since the Netburst micro-architecture have had 
at least 8-way associativity in all levels of cache, and thus increasing the radix 
from 4 to 8 is a sensible decision when targeting Intel machines. 


Just as the split-radix 2/4 algorithm requires two different types of leaf sub- 
transforms, a split-radix 2/8 algorithm would require three, which increases the 
complexity of statically elaborating and generating code. There is an alternative 
that does not require implementing three types of leaf sub-transform: where a 
size-N body sub-transform divides into a size N/2 body sub-transform and 
two size N/4 sub-transforms, the size N and size N/2 sub-transforms may be 
collected together and computed as a size-8 sub-transform. Thus the transform 
is computed with two types of leaf sub-transform and two types of body sub- 
transform, instead of three types of leaf sub-transform and one type of body 
sub-transform, as with the standard split-radix 2/8 algorithm. 


For the size-128 tranform in [link], either the sub-transform at line 19 can be 
subsumed into the sub-transform at line 20, or the sub-transform at line 20 can 
be subsumed into the sub-transform at line 23 — but not both. The latter choice 
is better because it involves larger transforms. 


CBody *CHardCodedLeaf: :find_subsumable_sub_transform( 
vector<CNode *>::reverse_iterator 1) { 
CBody *first = (CBody *)(*1); itt; 
while(i != bs.rend()) { 
if(!((*i)->type().compare("body"))) { 
CBody *second = (CBody *)(*1); 
if(first->N == second->N*2 && first- 
>offset == second->offset) { 
bs.erase((++1).base()); 


return second; 


} 
} 


++1% 


i 
return NULL; 
i 
void CHardCodedLeaf::increase_body_radix(void) { 
vector<CNode *>::reverse_iterator ri; 
for(ri=bs.rbegin(); ri!=bs.rend(); ++ri) { 
if(!((*ri)->type().compare("body"))) { 
CBody *n1 = (CBody *)(*ri); 
CBody *n2 = find_subsumable_sub_transform(ri1); 
if(n2) ni->size *= 2; 


} 
Doubling the radix of body sub-transforms 


The code in [link] iterates in reverse over a list of sub-transforms and doubles 
the radix of the body sub-transforms. Because the list may include multiple 
types, type introspection at lines 6 and 20 filters out all types that are not body 
sub-transforms. For each body sub-transform, the increase_body_radix 
function searches upwards through the list for a subsumable body sub- 
transform (using find _subsumable_sub_transform) and if a match is 
found, the smaller sub-transform is removed from the list, and the size of the 
larger sub-transform is doubled. 


®@ output @ input 
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FFT with sorted radix-2/4 and 
size-8 body sub-transforms 


[link] depicts the memory access patterns of a size-128 transform where the 
outermost body sub-transform has subsumed a smaller sub-transform to 
become a size-8 sub-transform. The columns from 33 onwards show the sub- 
transform accessing eight elements in the output data array (cf. [link], which 
shows the memory access patterns of the same transform prior to doubling the 
radix of the outer sub-transform). 


Optimizing the hierarchical structure 


The largest transform that has been considered so far is size-128. As it stands, 
the hard-coded leaf approach begins to generate code of unwieldy proportions 
as the size of the transform tends towards tens of thousands or hundreds of 
thousands of points. This is due to the lists of statically elaborated body sub- 
transform calls, e.g., a size-262,144 transform contains a lengthy list of 7279 
such calls. 


While long lists of statically elaborated calls are one extreme, the other is to 
compute the body sub-transforms with a recursive program. The former option 
degrades performance for larger transforms, while the latter option curbs 
performance for smaller transforms. A compromise is to somehow compress 
blocks of statically elaborated sub-transform calls. 


The approach presented here extracts the hierarchical structure from the 
sequence of body sub-transforms and emits a set of functions that are neither 
too small (as in the case of a recursive program) nor too large (as is the case 
with full static elaboration). This is accomplished by adapting the Sequitur 
algorithm [link], which builds a grammar of rules from a sequence of symbols, 
and enforces two basic constraints: 


1. no pair of adjacent symbols (referred to as a digram) appears more than 
once in the grammar; 
2. every rule is used more than once. 


The resulting grammar is an efficient hierarchical representation of the original 
sequence. Additional constraints can be imposed to limit the maximum or 
minimum size of each rule, which enable the size of the resulting functions to 
be tuned to be not too small and not too large. 


To build the grammar, each body sub-transform is represented by a symbol 
consisting of the size and offset of the sub-transform. The radix is discarded, 
because it can be inferred from the size. Here are several other details relevant 
to this particular application of Sequitur: 


e A digram of two sub-transforms is deemed to match another digram when 
the size of each sub-transform matches the size of the other digram's 
respective sub-transform and the relative offsets between sub-transforms 
within each digram match; 

e Sub-transform offsets are maintained to be always relative to the base of 
the containing rule — when a rule is constructed, the offsets of the symbols 
within that rule are adjusted to be relative to the base of the new rule, and 
when a rule is subsumed (due to violation of constraint 2: every rule must 
be used more than once), the offsets are recomputed to be relative to the 
subsuming rule. 


void sfft_dcf8192_shl16_8 4(sfft_plan_t *p, SFFT_D *o 
ut) { 

X_4(out+0, 32, p->ws[0]); 

X_4(out+128, 32, p->ws[0]); 

X_4(out+192, 32, p->ws[0]); 

X_8(out+O, 128, p->ws[2]); 


is 
void sfft_dcf8192_shl16_8 5(sfft_plan_t *p, SFFT_D *o 
ut) { 

X_8(out+O, 64, p->ws[1]); 

X_8(out+128, 64, p->ws[1]); 


} 
void sfft_dcf8192_shl16_8 9(sfft_plan_t *p, SFFT_D *o 
ut) { 
X_8(out+O, 64, p->ws[1]); 
X_4(out+128, 32, p->ws[0]); 
X_4(out+192, 32, p->ws[0]); 
sfft_dcf8192_shl16_8 5(p, out+256); 


X_8(out+0, 256, p->ws[3]); 


s; 
void sfft_dcf8192_shli16_8_13(sfft_plan_t 


out) { 


sfft_dcf8192 shli6_8 4(p, out+0); 

sfft_dcf8192_shl16_8 5(p, out+256); 
sfft_dcf8192_ shli6_8 4(p, out+512); 
sfft_dcf8192_shl16_8 4(p, out+768); 


X_8(out+0, 512, p->ws[4]); 


} 
void sfft_dcf8192_shli6_8_ 14(sfft_plan_t 


out) { 


sfft_dcf8192_ shli6_8 9(p, out+0); 
sfft_dcf8192 shli6_8 9(p, out+512); 


a 
void sfft_dcf8192_shli6_8_18(sfft_plan_t 


out) { 


sfft_dcf8192 shli6_8 9(p, out+0); 
sfft_dcf8192_ shli6_8 4(p, out+512); 
sfft_dcf8192_ shli6_8 4(p, out+768); 


sfft_dcf8192_shl116_8 14(p, 
X_8(out+0, 1024, p->ws[5]); 


s; 
void sfft_dcf8192_shli16_8_ 22(sfft_plan_t 


out) { 
sfft_dcf8192_shl116_8 13(p, 
sfft_dcf8192_shl116_8 14(p, 
sfft_dcf8192_shl16_8 13(p, 
sfft_dcf8192_ shl116_8 13(p, 
X_8(outt+0, 2048, p->ws[6]); 


out+1024); 


out+O); 

out+1024); 
out+2048) ; 
out+3072); 


SFFT_D 


SFFT_D 


SFFT_D 


SFFT_D 


* 


* 


* 


* 


} 
void sfft_dcf8192_shli6_8_i1(sfft_plan_t *p, SFFT_D *o 


ut) { 
sfft_dcf8192_sh116_8_22(p, 


sfft_dcf8192_shl1i6_8 18(p, 
sfft_dcf8192_ shl16_8 18(p, 
sfft_dcf8192_ shl16_8 22(p, 
sfft_dcf8192_shl16_8 22(p, 
X_8(out+0, 8192, p->ws[8]); 


out+O); 

out+4096); 
out+6144); 
out+8192); 
outt+12288) ; 


i 
Optimized body sub-transforms for size-8192 FFT 


Example 


A size-8192 hard-coded leaf FFT requires 229 calls to radix-2/4 and size-8 
body sub-transforms. After optimizing the sequence of calls with Sequitur, the 
compact set of functions shown in [link] replaces a sequence of 229 calls. 


Compared to the full list of statically elaborated calls, the optimized set of 
functions requires less code space while achieving better performance; and 
compared to a recursive program, the optimized set of function calls is faster 
(due to lower call and stack overhead) while trading off an acceptably small 
amount of code space. 


Scalability 


The technique presented in this section has been verified for transforms ranging 
in size from 2° through to 27° (32 mega) points. The technique works well up 
until sizes of about 2'® points, but for larger transforms the elaboration and 
compile times begin to exceed 1 second or so, and the code size again begins to 
grow large. For transforms larger than 21° points, a recursive program can be 
used until leaves of size 2'® points are reached, at which point the technique 
presented in this section is used. 


Other vector lengths 


The method of vectorizing the hard-coded leaf FFT is similar to that of the 
hard-coded FFT in "Other vector lengths"; the only difference here is the level 
of scale. 


The hard-coded FFT was vectorized by collecting together primitive leaf 
operations that loaded data from adjacent memory locations. The hard-coded 
leaf FFT has already been sorted such that consecutive leaf sub-transforms load 


leaves"), so the task is easier in this case — at least in one respect. 


Size Input array addresses 
16 {0, 64, 32, 96, 16, 80, 112, 48, 8, 72, 40, 104, 120, 56, 24, 88} 
16 {1, 65, 33, 97, 17, 81, 113, 49, 9, 73, 41, 105, 121, 57, 25, 89} 
16 {2, 66, 34, 98, 18, 82, 114, 50, 10, 74, 42, 106, 122, 58, 26, 
90} 
{3, 67, 35, 99, 19, 83, 115, 51, 123, 59, 27, 91, 11, 75, 107, 
8(x2) 43} 
{4, 68, 36, 100, 20, 84, 116, 52, 124, 60, 28, 92, 12, 76, 108, 
8(x2) 44} 
49, 09,37; 101,21, 85,117,593; 125,61, 29; 93,13, 77,109, 
8(x2) 45} 
16 {126, 62, 30, 94, 14, 78, 110, 46, 6, 70, 38, 102, 118, 54, 22, 
86} 
16 {127, 63, 31, 95, 15, 79, 111, 47, 7, 71, 39, 103, 119, 55, 23, 


87} 
Sorted size-16 leaf nodes in size-128 hard-coded leaf FFT, grouped for VL-2 


[link] shows the sorted size-16 leaf sub-transforms for a size-128 transform 
with the rows divided into VL-2 groups. Because each group of two leaf sub- 
transforms loads data from adjacent memory locations, the group of sub- 
transforms can be loaded in parallel with vector memory operations, and all (or 
some) of the computation done in parallel. The first, third and fourth groups in 


[link] contain leaf nodes of the same size/type; these are the easiest vector leaf 
sub-transforms to compute, as described in "Homogeneous leaf sub-transform 
vectors". The second group of rows contains leaf sub-transforms of differing 
size/type, and computing these sub-transforms is covered separately in 
"Heterogeneous leaf sub-transform vectors". 


Homogeneous leaf sub-transform vectors 


void 
sfft_fcf128_shli6_8 ee(offset_t *1s,const SFFT_D *in, 
SFFT_D *out){ 
SFFT_R rO,ri,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11,r12,r4 
3,114,115; 
L_4(in+is[0],int+is[1],1int+is[2],in+is[3],&r0,&r1,&r2 
,&03); 
L_2(int+is[4],1in+is[5],int+is[6],int+is[7],&r4,&r5, &r6 
,&r7); 
K_O0(&r0,&r2,&r4,&r6); 
K_N(VLIT4(0.7071,0.7071,0.7071,0.7071), 
VLIT4(0.7071, -@.7071, 0.7071, -@.7071), 
&r1,&r3,&r5,&r7); 
L_4(int+is[8],in+is[9],int+is[10],int+is[11],&r8,&r9,& 
r10,&r11); 
L_4(in+is[12],in+is[13],in+is[14],in+is[15],&r12,&r 
13,&r14,&r15); 
K_O0(&r0,&r4,&r8,&r1i2); 
K_N(VLIT4(0.9239,0.9239,0.9239,0.9239), 
VLIT4(0.3827, -0.3827,0.3827, -0.3827), 
&r1,&r5,&r9,&r13); 
TX2(&r0,&r1); TX2(&r4,&r5); TX2(&r8,&r9); TX2(&r1i2, 
&r13); 
S_4(r0,r4,r8, r12, out0+0, outO+8, OUtO+16, OULTO+24) ; 
S_4(r1,r5,r9,r13, outi+0, out1+8, out1+16, outi+24) ; 
K_N(VLIT4(0.7071,0.7071,0.7071,0.7071), 
VLIT4(0.7071, -@.7071,0.7071, -@.7071), 
&r2,&r6,&r10, &r14); 
K_N(VLIT4(0.3827,0.3827,0.3827,0.3827), 
VLIT4(0.9239, -0.9239,0.9239, -0.9239), 


&r3,&r7,&r11, &r15); 

TX2(&r2,&r3); TX2(&r6,&r7); TX2(&r10,&r11); TX2(&r1 
4,&r15); 

S_4(r2,r6,r10, r14, out0+4, out0+12, out0+20, out0+28); 

S_4(r3,r7,r11,r15, outi+4, outi+12, out1+20, out1+28); 
t 
Homogeneous size-16 leaf sub-transform for VL-2 size- 
128 hard-coded leaf FFT 


The vector leaf sub-transforms of a single size/type are handled in the same 
way as a VL-1 sub-transform, with one difference: the vector registers must be 
transposed before the data is stored to memory in the output array. In the 
example shown in [link], the transposes take place at lines 16 and 25. 


Prior to the store operations, each position of the vector register (each position 
being a whole complex word) contains an element belonging to each of the leaf 
sub-transforms composing the vectorized sub-transform. Because each leaf sub- 
transform is stored sequentially to different locations in memory with aligned 
vector store operations, sets of registers are transposed such that each vector 
register contains elements from only one leaf sub-transform. 


Heterogeneous leaf sub-transform vectors 


void 
sfft_fcf128_shli6_8 eo(offset_t *1s,const SFFT_D *in, 
SFFT_D *out){ 

SFFT_R r0_1,r2_3,r4_5,r6_7,r8_9,r10_11, r12_13,r14_1 
So, 


r16_17,r18_19,r20_21, r22_23,r24_25, r26_27, r28_29, r30_ 
oy 
L_4 4(int+is[0],int+is[1],in+is[2],intis[3], 
&rQ_1,&r2_3,&r16_17, &r18_19); 
L_2 2(int+is[4],in+is[5],in+is[6],intis[7], 
&r4_5,&r6_7,&r20_21, &r22_ 23); 
K_N(VLIT4(0.7071,0.7071,1,1),VLIT4(0. 7071, -0.7071,0 
' -O), 
&rO_1,&r2_3,&r4_5,&r6_7); 


L_4 2(intis[8],1in+is[9],in+is[10],int+is[11], 
&r8_9,&r10_11, &r28_29,&r30_31); 
L_4 4(int+is[12],int+is[13],intis[14],intis[15], 
&r12_13,&r14_15,&r24_25,&r26_27); 
K_N(VLIT4(0.9239,0.9239,1,1),VLIT4(0.3827, -0.3827,0 
' -Q), 
&rQ_1,&r4_5,&r8_9,&r12_13); 
S_4(r0_1, r4_5, r8_9, r12_13, out0+0, outO0+8, outO0+16, out 
O+24); 

K_N(VLIT4(0.3827,0.3827,0.7071,0.7071), 
VLIT4(0.9239, -0.9239,0.7071, -0.7071), 
&r2_3,&r6_7,&r10_11, &r14_15); 

S_4(r2_3, r6_7, r10_11, r14_15, out0+4, out0+12, out0+20, 

out0+28); 

K_N(VLIT4(0.7071,0.7071,1,1),VLIT4(0. 7071, -0.7071,0 

' -O), 
&r16_17,&r18_19,&r20_21,&r22_23); 

S_4(r16_17, r18_19, r20_21, r22_23, out1+0, outi+4, outi+ 

8,outi+12); 

K_N(VLIT4(0.7071,0.7071,1,1),VLIT4(0. 7071, -0.7071,0 

! -Q), 
&r24_25,&r26_27,&r28_29, &r30_31); 

S_4(r24_25, r26_27, r28_29, r30_31, out1+16, out1+20, out 

1+24,outi+28) ; 

} 

Heterogeneous size-16 leaf sub-transform for VL-2 
size-128 hard-coded leaf FFT 


In the case of a vector comprising heterogeneous leaf sub-transforms, the data 
is transposed into separate sub-transforms following the primitive leaf 
operations. The remainder of the computation is carried out separately for each 
leaf sub-transform in the vector, and no further transposes are required. 


When elaborating and generating code for VL-2 transforms, there are only two 
heterogeneous leaf sub-transforms that might be required, but for other vector 
lengths the combinations are more complex. During the elaboration process, 
each unique combination that is encountered in the sorted list of leaf sub- 
transforms is elaborated into a function with repeated calls to the elaborate 


function, as was done in "Vector length 1" in order to elaborate a sub-transform 
composed of two size Nieaf /2 sub-transforms. 


[link] is an example of a heterogeneous size-16 VL-2 leaf sub-transform, where 
one size-16 leaf sub-transform is loaded into the lower halves of the vector 
registers, and the data from another leaf sub-transform composed of two size-8 
sub-transforms is loaded into the upper halves. The primitive leaf operations at 
lines 5, 7, 11 and 13 transpose each sub-transform's data into separate vector 
registers, and the remainder of the computation is performed on each sub- 
transform separately. The size-16 sub-transform is stored to sequential locations 
in memory at lines 17 and 21, while the sub-transform composed of two size-8 
leaf sub-transforms is stored to memory at lines 24 and 27. 


Streaming stores 


Some machines support streaming store or non-temporal store instructions; 
these instructions are used to store data to locations that do not have temporal 
locality, and thus the cache can be bypassed. The hard-coded leaf FFT 
described in the previous sections splits the computation into a pass of leaf sub- 
transforms and several passes of body sub-transforms. For large transforms 
where the size of the data exceeds the outermost level of cache, the non- 
temporal store instructions can be used in the leaf sub-transforms to bypass the 
cache when storing data to the output array; this can greatly improve 
performance by keeping other data in cache. The Intel SSE and AVX vector 
extensions both support streaming stores. 


Performance 
Single- Double- Single- Double- 
precision, precision, SSE precision, precision, AVX 


SSE (VL-2) (VL-1) AVX (VL-4) (VL-2) 


: 
> ’ : ace es Po oo 
PEGE LEE OPEL OE EP PE EES SS i ee ee a ee 


Performance of hard-coded leaf FFTs on a Macbook Air 4,2. 


[link] shows the results of a benchmark for transforms of size 256 through to 
262,144 running on a Macbook Air 4,2. The speed of FFTW 3.3 running in 
estimate and patient modes is also shown for comparison. 


For each size of transform, precision and vector length (i.e., either SSE or 
AVX), several configurations of hard-coded leaf FFT were generated: three 
configurations of leaf size (16, 32 and 64), and if the transform was larger than 
32,768, an additional transform with size-16 leaves and streaming store 
instructions was also generated. Before running the benchmark, the library was 
calibrated and the fastest configuration selected (details of the calibration are 
described in Calibration"). 


For most sizes of transform, precision and vector length, SFFT is faster than 
FFTW running in patient mode. For the transforms with memory requirements 
that are approximately at the limits of the cache, FFTW running in patient 
mode is sometimes marginally faster than SFFT. Once the transforms exceed 
the size of the cache, SFFT is again the fastest. 


It is important to note that FFTW running in patient mode evaluates a huge 
configuration space of parameters (and thus takes a long time to calibrate), 
while SFFT has, in this case, only evaluated either three or four configurations 
per transform. 


In practice 


SFFT is not itself an FFT library; the name refers to the elaboration program 
that reads a configuration file and generates the code for an FFT library. The 
code for the FFT library is then built as any other library would be. 


Organization 


As well as the generated code, there is infrastructure code which is common to 
all libraries generated by SFFT. This can be broadly categorized into three 
parts: initialization, dispatch and calibration. 


Initialization 


Before an application can compute an FFT with SFFT, it must initialize a plan 
for the specific size, precision and direction of FFT. The library may have 
several FFTs and configurations that can compute the requested FFT, and it 
chooses the fastest option by timing each of the candidate configurations, 
which is at most 8 for any size of transform — a very small space compared to 
FFTW's exhaustive search of all possible FFT algorithms and configurations. 
Results and discussion describes an alternative to calibration, where machine 
learning is used with data collected from benchmarks to build a model that 
predicts performance. 


After determining which implementation and parameters will be used, the 
initialization code allocates memory and populates any lookup tables that may 
be required. Before returning the plan to the application, a function pointer in 
the plan is updated to point to the FFT that has just been initialized. 


Dispatch 


Applications do not invoke any of the FFTs within SFFT directly. Rather they 
invoke a dispatch function on an initialized plan, which in turn transfers control 
to the correct FFT code within SFFT. The use of a dispatch function is purely a 
matter of convenience, so that users only need to deal with a few simple 
functions. 


Calibration 


SFFT contains calibration code to measure the performance of the possible 
configurations of FFT on the target machine, which is at most 8 for each size of 


transform. Following calibration, the timing data is written to a file, which is 
then used by SFFT to select the fastest possible FFT for a given problem 
running on that machine. 


Usage 
SFFT is used much like other FFT libraries: 


1. A plan for an FFT is initialized; 

2. Using the plan, an FFT is computed (this step may be repeated many 
times); 

3. The plan is destroyed. 


The plan is initialized for a given size, precision and direction of transform, and 
may then be executed any number of times on any data. Any number of plans 
can be simultaneously created and used. 


int n = 1024; 
double complex __attribute__ ((aligned(32))) *input 
, “output; 
input = _mm_malloc(n * sizeof(double complex), 32); 
output = _mm_malloc(n * sizeof(double complex), 32) 


for(i=0;i<nj;it+) input[i] = i; 


sfft_plan_t *p = sfft_init(i, SFFT_FORWARD|SFFT_DOU 
BLE|SFFT_AVX); 


if(p) { 


sfft_execute(p, input, output); 
for(1=0;1<n;it++) 
printf("%d %*Ff %f\n", 1, creal(output[1i]), cimag 
(output[i])); 
sfft_free(p); 


telsef{ 


printf("Plan unsupported\n"); 
SFFT example usage 


In [link], a size-1024 transform is computed on double-precision data with 
AVX enabled. In lines 2-4, the input and output arrays are allocated with 32 
byte alignment, as is required for aligned AVX memory operations. The plan is 
initialized at line 8, used to compute an FFT at line 12 (provided the requested 
plan is supported), and finally freed at line 20. 


Other optimizations 


In addition to generating a general-purpose library that can be calibrated for a 
machine and application at runtime, there are several situations where the SFFT 
library can be specially optimized: 


1. If the machine and application are fixed, a one time calibration can be 
performed and an optimized library containing only the fastest transforms 
specific to the application and machine is generated; 

2. If the application is fixed, an optimized library containing only the 
transforms specific to the application is generated (and the library is 
calibrated the first time it is used on each machine); 

3. If the machine is fixed, an optimized library containing only the 
transforms specific to the machine is generated (and an application can use 
any transform without calibration). 


Benchmark Methods 


This chapter describes the benchmarking methods used to evaluate the 
performance and accuracy of various FFT implementations throughout this 
thesis. 


The two architectures of interest are the Intel x86 architecture and the ARM 
architecture. A comprehensive set of results collected from a wide range of 
machines implementing these architectures is presented in Results and 
discussion, but throughout the rest of the thesis, benchmarks are performed 
on an Apple Macbook Air 4,2; a widely available and currently state-of-the- 
art machine that is equipped with an Intel Core i5-2557M. [link] 
summarizes the specifications of the machine. 


For the x86 benchmarks, an existing framework called BenchFFT [link] 
was used. For the ARM benchmarks, which were performed on iOS 
devices, there was no existing FFT benchmark software, and so an 
application was written for this purpose, which is described in "ARM 
architecture". 


Macbook Air 4,2 
CPU Dual-core Intel Core i5 (i5-2557M) 
CPU clock 1.7 GHz (turbo to 2.7GHz with one core) 
L1 cache 32KB I-cache & 32KB D-cache 
L2 cache 256KB 
L3 cache 3MB shared 


Memory 4 GB of 1333 MHz DDR3 SDRAM 


OS OS X 10.7.2 
SIMD extensions SSE and AVX 


Specifications of the primary test machine 


x86 architecture 


The x86 benchmarks were performed with BenchFFT, a collection of FFT 
libraries and benchmarking software assembled by Frigo and Johnson, the 
authors of FFTW [link]. The benchmarks in BenchFFT use timing and 
calibration code from Imbench, a performance analysis tool written by 
Larry McVoy and Carl Staelin [link]. 


Timing 


BenchFFT measures the initialization time and runtime of an FFT 
separately. The initialization time is measured only once, and thus outliers 
due to effects from external factors such as OS scheduling are occasionally 
observed. Routines from Lmbench are then used to calibrate the minimum 
number of FFT iterations required for accurate measurement using the 
gettimeofday function. Finally, the time taken to run the minimum 
number of iterations is measured eight times, from which the minimum 
time divided by the number of iterations is used, in order to factor out 
effects from external factors. 


The minimum time for a transform is then used to determine a scaled 
inverse time measurement, sometimes known as CTGs. CTG are defined 
as: 

Equation: 


5N log, (NV) 


CTGs = 
10°%t 


for complex transforms and 
Equation: 


2.5.N log, (V) 


CTGs = 
10°t 


for real transforms, where ¢ is the time taken to run one transform (in 
seconds). Unless the Cooley-Tukey radix-2 algorithm is used, a 
measurement expressed in CTGs is not an actual FLOP count — it is a rough 
measure of an algorithm's efficiency relative to the radix-2 algorithm and 
the clock speed of the machine. 


When a transform has several variants (such as direction or radix), 
BenchFFT reports the speed of the FFT as being the fastest of the possible 
options. 


Accuracy 


To measure the accuracy of a transform, BenchFFT compares an FFT with 
an arbitrary-precision FFT computed on the same inputs, and reports the 
relative RMS error. The inputs are pseudo-random in the range [0. 5, 0. 5) 
and the arbitrary-precision FFT has over 40 decimal places of accuracy. 


When a transform has several variants (such as direction or radix), 
BenchFFT reports the accuracy as being worst of the results. 


Compiling 


Except where otherwise noted, ICC version 12.1.0 for OS X was used to 
compile 64-bit code. For OS X builds, the compiler flags used were “-O3”, 
while “-O3 -msse2” (or equivalent) was used for Linux builds. In the cases 
where the FFT uses AVX, the code is compiled with “-xAV X” or “-mavx” 
(depending on compiler). 


Some libraries included in the BenchFFT software have their own 
compilation scripts which override the defaults, and in the case of 
commercial libraries (such as Intel IPP and Apple vDSP), the compiler 
flags are of little consequence because the libraries are distributed in binary 
form. 


Data format 


FFT libraries use interleaved format and/or complex format to store the 
data. In the case of interleaved format, the real and imaginary parts of 
complex numbers are stored adjacently in memory, while in the case of split 
format, the real and imaginary parts are stored in separate arrays. 


The majority of FFT libraries use interleaved format to store data. In the 
case where the library supports interleaved or split format, BenchFFT uses 
interleaved format. However there are a few libraries that only support split 
format, and in theses cases it should be noted the results are not strictly 
comparable (Apple vDSP is one such case). 


ARM architecture 
There was no existing FFT benchmarking software for iOS on ARM 


devices, and so a benchmarking tool was written. The tool runs the 
benchmarking in a thread of normal priority. 


Compiling 


The code was compiled with Apple clang compiler 3.0 for ARMv7 targets 
running iOS 5.0. The compiler flags used were “-O3 -mfpu=neon”. 


Timing 


The Apple A4 and A5 SoCs are built around the ARM Cortex-A8 and 
Cortex-A9 cores, which have hardware cycle counters that can be used for 
precise timing. The cycle counter control registers can only be accessed in 
kernel mode, and so the high resolution timer available through the 
mach_absolute_time function was used instead. 


For a given size of transform, a calibration routine determines the number 
of iterations that must be run such that the total runtime is approximately 
one second. After calibration, each FFT to be evaluated is run for the pre- 
determined number of iterations — this loop is run eight times, and the 
fastest time divided by the number of iterations is taken to be the FFTs 
runtime. By running each FFT for approximately one second, and repeating 
the measurement eight times to find the best time, the effects from external 
factors such as OS scheduling are minimized. As with BenchFFT, the time 
is expressed in CTGs. 


Results and Discussion 


In order to test the hypotheses set out in Introduction, SFFT was 
benchmarked alongside FFTW and other libraries on a wide range of 
machines, as per the methods set out in Benchmark methods. The majority 
of the data was collected on Linux machines populated with SSE capable 
Intel microprocessors, with some additional data collected on small set of 
AVX and ARM NEON machines. The results are divided into three 
sections: speed, accuracy and setup time, with an additional section 
detailing a model that predicts SFFT's performance for different 
configurations. Finally, the chapter concludes by relating the results to other 
work. 


Modelstring Lid [2 L3 
Intel(R) Pentium(R) 4 CPU 2.80GHz 16 ol2 - 
Intel(R) Pentium(R) D CPU 3.00GHz 16 1024 - 
Intel(R) Pentium(R) M processor ; 
1000MHz Ss mes 
Intel(R) Xeon(TM) CPU 2.40GHz 16 2048 - 
Intel(R) Xeon(R) CPU E5335 @ ; 
2.00GHz = “a 
Intel(R) Xeon(R) CPU X5355 @ - 
2.66GHz ae Ones 
Intel(R) Xeon(R) CPU E5430 @ 39 6144 ; 


2.66GHz 


Intel(R) Xeon(R) CPU X5560 @ 32 256 8192 
2.80GHz 


Intel(R) Core(TM)2 CPU 6600 @ 


2.40GHz can] a | 
oo Quad CPU Q6600 30 A096 
aa Duo CPU E6850 30 A096 _ 
a Duo CPU E8400 30 6144 
ee aaa Duo CPU P8600 30 3072 _ 
en is CPU 660 @ 39 256 4096 
Intel(R) Core(TM) i7-2600 CPU @ 32 256 8192 


3.40GHz 


Linux benchmark machines, listed with the size of each level of cache (in 
kilobytes) 


[link] presents a summary of the Linux machines that were used to run 
benchmarks. The majority of the machines were functioning as either lab 
workstations or servers in a University environment. The benchmarks took 
approximately 12 hours to run, and while efforts were made to reduce each 
machine's load to a minimum, there were still transient system processes, 
such as log rotations and backups during the night that have introduced 
noise into the results. 


For the Linux benchmarks, both 32-bit and 64-bit statically-linked binaries 
for SFFT, FFTW 3.3 and SPIRAL were compiled with icc 12.0.5, gcc 4.4.5 


AVX and ARM NEON. 


2 


Accelerate. Because these libraries are only available in binary form, they 
are compared against the icc builds of SFFT, FFTW 3.3 and SPIRAL, 


4.2.1 and clang 3.0. The builds of SFFT and FFTW 3.3.1 for iOS 5 on 
because icc generally produced the fastest code. 


ARM NEON were compiled with Apple clang 3.0. 
Several binary libraries were also benchmarked: Intel IPP 7 and Apple 


The speed results are presented in subsections according to the SIMD 


and clang 1.1. For the OS X benchmarks, 32-bit and 64-bit binaries for 
SFFT, FFTW 3.3 and SPIRAL were compiled with icc 12.1.0, Ilvm-gcc 
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Performance comparison between SFFT and SPIRAL on SSE 
machines. Although SPIRAL is faster when compiled with clang 
1.1, [link] shows that SFFT is faster than SPIRAL when compiled 
with clang 3.0 


[link] summarizes the speed performance of SFFT against FFTW 3.3 
running in estimate mode on Linux machines with SSE. Twelve heatmaps 


are used to present data from different configurations. The three rows in the 


grid correspond to the three different compilers used, while the four 
columns correspond to the four different architecture and floating-point 
precision pairs. Within each heatmap, the rows correspond to different 
machines, and the columns correspond to different sizes of transform (2! 
through to 218). Shades of green indicate that SFFT is faster for a particular 
point of data, while shades of yellow through to red indicate that FFTW is 
faster; lighter shades indicate a small difference, while darker shades 


indicate a bigger difference in performance. The scale for the colour map is 
computed separately for each of the 12 heatmaps in the grid, so a particular 


colour in one heatmap is not directly comparable to the same colour in 


another heatmap; the colours are only meant to indicate differences within 
each heatmap. 


Similarily, [link] compares SFFT to FFTW 3.3 running in patient mode, and 
[link] compares SFFT to SPIRAL. There are fewer columns in the 
heatmaps of [link] because SPIRAL only computes single-threaded FFTs 
for sizes 2! through to 2!°. 


FFTW 3.3 in estimate mode 


[link] shows that SFFT is faster than FFTW 3.3 running in estimate mode in 
almost all cases over a range of Intel x86 machines that implement SSE. 
The horizontal streaks of yellow-red that can be seen in some heatmaps are 
outliers and likely caused by transient system processes that were running 
while SFFT was being benchmarked. Similar streaks appear at the same 
locations in Figures [link] and [link]. 


FFTW 3.3 in patient mode 


Core i7- Core 17- Core i5- Core i5- 
2600, single- 2600, 2557M, 2557M, 
precision double- single- double- 
precision precision precision 


a bY 
a 6 

. ! 
Terre Kee Sees Se P Se | 
ae : 
: 


: 

: 

Se Pe ESSE SET POE Ls 
aes 7 * 


soe de> oo Cy 2 o gary 
a SFE PETRA SE 


Performance of FFTs on recent Sandy Bridge machines, with x86_64 


SSE binaries. Compiler: icc 


[link] shows that SFFT is faster than FFTW 3.3 running in patient mode in 
the majority of cases over a range of Intel x86 machines that implement 
SSE. SFFT was generally slightly slower than fftw3-patient on older 
machines such as the Pentium 4's and the 1GHz Pentium M, while on the 
newer machines such as the Sandy Bridge based Core i7-2600 and the 
Nehalem based Core i5-660, SFFT was clearly faster than FFTW (see 
[link]). This could be explained by the fact that FFTW performs extensive 
instruction level optimizations, such as scheduling, and that the older 
processors have smaller instruction and trace caches. 


SPIRAL 
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Performance of clang-compiled x86_64 SSE FFTs on an Intel Core2 
Duo P8600 


The last row of [link] shows that SFFT is generally slower than SPIRAL 
when both libraries are compiled with clang 1.1. However, with more recent 
releases of clang, which do much more code optimization, the situation is 
reversed, as shown in [link]. In some cases SPIRAL compiled with clang 
3.0 is slower than SPIRAL compiled with clang 1.1, while SFFT is 
generally faster when compiled with clang 3.0. This demonstrates that the 
speed of automatically tuned SPIRAL code is specific to certain compilers. 


SPIRAL's double-precision performance is slightly better than SFFT when 
compiled with icc or gcc, while SFFT's single-precision code is faster than 
SPIRAL on recent machines, and of similar speed on older machines. 


AVX 


Of the machines that were used for benchmarks, only two supported AVX: 
the Macbook Air 4,2 with an Intel Core i5-2557M, and a Linux machine 
with an Intel Core i7-2600. [link] shows that SFFT is clearly faster than 
FFTW up until about 1024 points, while performance between the two is 
similar for larger transforms. 


Results for Intel IPP are also plotted in [link], but only for the Core i7-2600. 
IPP did not detect the existence of AVX on the Core i5-2557M, and instead 
used SSE, as plotted in [link]. Apple vDSP does not support AVX, and so 
SSE vDSP results for the Macbook Air 4,2's Core i5-2557M are also 
plotted in [link]. 
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Performance of FFTs on recent Sandy Bridge machines, with x86_64 
AVX binaries. Compiler: icc 
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Performance of single-precision FFTs on ARM NEON devices 
running iOS. Compiler: Apple clang 3.0 


SFFT and FFTW 3.3.1 were compiled with Apple clang 3.0 and 
benchmarked on an Apple iPod touch 4G and an Apple iPad 2, which 
contain the Apple A4 and A5 SoCs respectively. The A4 implements the 


ARM Cortex-A8, while the A5 implements the ARM Cortex-A9, both of 
which support ARM NEON. 


[link] shows that SFFT is easily faster than FFTW on both devices. This 
contradicts Frigo and Johnson's claim that the performance of FFTW is 
portable, and tends to support the idea that it is possible to write fast and 
portable code without exhaustive searches through the configuration space 
of all possible FFTs. 


A considerable amount of effort was needed to work around several 
problems that were encountered when targeting ARM NEON with Apple 
clang 3.0, and many of SFFT's primitive macros for NEON were written in 
inline assembly code. Among the problems encountered when targeting 
ARM NEON with Apple clang 3.0: 


1. There is no way of explicitly specifying memory alignment when 
using vector intrinsics; 

2. Fused multiply-add/subtract intrinsics do not currently compile to the 
correct instructions because of a bug in clang; 

3. Clang's inline assembly front-end lacks the syntax and semantics to 
properly address the dual-size aliased vector registers. 


The above problems affect all FFT libraries equally, and it seems that 
portability depends critically on the quality of the machine specific code 
and macros. 
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Accuracy of FFTs on an Intel Core i7-2600. SFFT, 
FFTW and SPIRAL were compiled for x86_64 with 
icc 


The accuracy of each FFT was measured as per the methods in Benchmark 
methods. The accuracy of single and double precision FFTs on an Intel 
Core i7-2600 is plotted in [link], and shows that the relative RMS error for 
FFTW, SFFT and SPIRAL is within an acceptable range. Graphs for all 
other machines are similar. 
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Setup times of FFTs on an Intel Core i7-2600. SFFT, 
FFTW and SPIRAL were compiled for x86_64 with icc 


[link] shows that FFTW, in patient mode, requires several orders of 
magnitude more time to initialize as it searches for a fast FFT configuration. 
SPIRAL has a very fast setup time, because it is entirely statically 
elaborated and needs no dynamic initialization. The setup time for SFFT is 
comparable to FFTW in estimate mode, though SFFT's setup time begins to 
increase for transforms larger than 8192 points. This is likely because of 
repeated calls to the complex exponential function as twiddle factor LUTs 
are elaborated; no effort was made to optimize this setup code, and it is 
likely that it would be much faster if the calls to the complex exponential 
function were optimized. 


Graphs for all other machines are similar. 


Binary size 


Compared to other libraries, SFFT produced larger binaries for the 
benchmarks, because there is currently no optimization performed between 
transforms contained in the same library. For 64-bit single precision 
binaries on OS X with AVX, the size of the SFFT benchmark was 
approximately 2.8 megabytes while the size of the FFTW benchmark was 
1.8 megabytes. 


Predicting performance 


For each size of transform on a particular machine, SFFT chooses the 
fastest configuration from a set of up to eight possible configurations. Small 
transforms have only one option, which is a fully hard-coded transform, 
while larger transforms have up to eight, which could include the four-step 
transform, and several variants of the hard-coded leaf transform, where 
each variant corresponds to a particular size of leaf sub-transform and size 
of body sub-transform, and for size-16 leaf sub-transforms, a streaming 


store variant is included too. The decision of exactly which configuration to 
use depends on the size of transform, the compiler, and the characteristics 
of the host machine. 


For the benchmarks in this chapter, SFFT used a calibration routine to 
choose the fastest configuration. The calibration data was collected, along 
with some data about the machine and the compiler, and used to train a 
classifier. 


The data was processed into instances, with each instance having attributes 
for the size of the transform and the precision, the size of each level of 
cache, the architecture and micro-architecture of the machine, the SIMD 
extensions, the OS, the compiler used, and the CPU frequency. In total there 
were 3348 instances of data, each of which had 12 attributes. 


Weka [Link] was used to experiment with several classifiers, and a REPTree 
classifier with bagging was used to train a model. Using 10-fold cross- 
validation, the model correctly classified 76.1% of the instances with a 
weighted average precision of 74.8%, which tends to confirm the existence 
of a relationship between the characteristics of the machine and the 
performance of a particular FFT configuration. 


The accuracy of the classifier is promising, and it has the potential to 
replace the calibration code in SFFT. It is highly likely that if the noise in 
the data was reduced through the use of an isolated benchmarking 
environment, the accuracy of the classifier would increase. The accuracy 
would also likely benefit from a larger dataset collected from a larger range 
of benchmark machines. 


Split-radix vs. conjugate-pair 
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Ordinary split-radix versus conjugate-pair split-radix on 
an Intel Core i5-2557M. SFFT, FFTW and SPIRAL 
were compiled for x86_64 with icc 


In order to quantify the gain in performance that might be attributable to the 
use of the conjugate-pair algorithm, SFFT was retrospectively modified to 
compute the FFT using the ordinary split-radix algorithm as well as the 
conjugate-pair algorithm. The results of benchmarks between the two 
algorithms, as well as FFTW and SPIRAL, are plotted in [Link]. 


Unexpectedly, the ordinary split-radix algorithm is faster than the 
conjugate-pair algorithm for some smaller sizes of transform, but for 
transforms above a certain size, the conjugate-pair algorithm is faster by a 
few hundred MFLOPS. 


The performance advantage of the ordinary split-radix algorithm for smaller 
sizes of transforms is likely due to shorter chains of dependent instructions 
where twiddle factors are loaded and used. Consider that the ordinary split- 
radix algorithm separately loads two twiddle factors into two registers, and 
there are no dependencies between these instructions, while the conjugate- 
pair algorithm must load one twiddle factor and then duplicate it into 
another register, which does result in dependent instructions. Thus the 
ordinary split-radix algorithm is faster for smaller transforms where 
memory bandwidth is not the limiting factor, but when memory bandwidth 
does become the limiting factor, the conjugate-pair algorithm is faster. 


In future, SFFT could exploit the performance advantage of the ordinary 
split-radix algorithm when computing smaller sizes of transforms. 


Applications of this work 


This section provides an overview of how the techniques presented in this 
thesis may be applied to the prime-factor algorithm, sparse Fourier 
transforms, and multi-threaded transforms. 


Prime-factor algorithm 


The techniques presented in this work rely on the fact that FFTs operating 
on signal lengths that are a power-of-two can be factored into smaller 
power-of-two length components, which are computed in parallel by being 
evenly divided into a number of SIMD vector registers that are a power-of- 
two length. 


The prime-factor algorithm factors other lengths of FFTs into components 
that are co-prime in length, and ultimately small prime components, which 
do not evenly divide into the power-of-two length SIMD registers, except in 
the special case where a SIMD register contains only one complex element 
(such is the case with double-precision on SSE machines). 


Because the prime components do not evenly divide into power-of-two 
length SIMD registers, the algorithm level vectorization techniques 
presented in this work are not directly applicable. In contrast, the auto- 
vectorization techniques used in SPIRAL [link], [link], [link] are performed 
at the instruction level, and are applicable to the prime-factor algorithm, but 
as the results in [link] show, the downside of SPIRAL's lower level 
approach is that performance for power-of-two transforms scales poorly 
with the length of the SIMD register. 


Sparse Fourier transforms 


The recently published Sparse FFT [link], [link] will benefit from the 
techniques presented in this work because the inner loops use small DFTs 
(e.g, 512 point for a certain 256k point sparse FFT), which are currently 
computed with FFTW. Replacing FFTW with SFFT will almost certainly 
result in improved performance, because SFFT is faster than both FFTW 
and Intel IPP for the applicable small sizes of transform on an Intel Core i7- 
2600 (see [link]). 


Version 2.0 of the Sparse FFT code is scalar, and would benefit greatly from 
explicitly describing the computation with SIMD intrinsics. However, a key 
difference between the sparse Fourier transform and other FFTs is the use of 
conditional branches on the input signal data. This has performance 
implications on all machines, but it is worth noting that some machines will 
be drastically affected by this, such as the ARM Cortex-A8, where the 
SIMD pipeline is located behind the main pipeline, resulting in fast 
transfers from the main CPU unit to the SIMD pipeline, but large penalties 
when SIMD registers or flags are accessed by the main CPU unit. 


Multi-threaded transforms 
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Speed of multi-threaded four-step algorithm running on an 
Intel Core i5-2557M with four threads. The algorithm 
decomposes transforms into smaller single-threaded 
components, which are computed above with three 
different implementations. All code was compiled with icc 
for x86_64 with SSE. 


MatrixFFT has recently shown that the four-step algorithm [link], designed 
to efficiently use hierarchical or external memory on Cray machines in the 
1980's, is useful for computing large multi-threaded transforms on modern 
machines, providing performance far surpassing that of FFTW's multi- 
threaded performance [link]. 


The four-step algorithm decomposes a transform of size _into a two- 


dimensional array of size , X  9Where = 1 9,and ;= 9= V 
(or close) often obtains the best performance. 


The four-steps of the algorithm are: 


1.Compute ; FFTs of length 42 along the columns of the array; 

2. Multiply each element of the array with ,where and are the 
array coordinates; 

3. Transpose the array; 

4.Compute » FFTs of length , along the columns of the array. 


Each step can be divided amongst a pool of threads, with a synchronisation 
barrier between the third and fourth steps. The transforms in steps one and 
four operate on sequential data, and if they are small enough, they are not 
subject to bandwidth limitations (and if they are not small enough, they can 
be further decomposed with the four-step algorithm until they are small 
enough). The bandwidth bottleneck does not disappear, but it is factored out 
into the transpose in step three, and because of this, the performance of the 
small single-threaded 1D transforms used in steps one and four correlate 
with the overall multi-threaded performance. A simple multi-threaded 
implementation of the four-step algorithm was benchmarked with SFFT and 
FFTW transforms, and the results are shown in [link], which tends to 
confirm that the performance of single-threaded transforms for steps one 
and four translates to the overall multi-threaded performance when using 
the four-step algorithm. 


Similar work 


Aside from Bernstein's FFT library, which was designed in the days of 
scalar microprocessors and has not been updated since 1999, there have 
been a few other challenges to the automatically adaptive approach of 
FFTW, but none present concrete results that definitively dismiss the idea. 
Most recently, Vasilios et al. presented an approach that uses the 
characteristics of the host machine to choose good FFT parameters at run 
time [link], but their approach has several issues that render it almost 
irrelevant. First, the approach uses optimizations that only apply to scalar 
machines, viz. twiddle factor symmetries are exploited to compress the 
twiddle LUTs, and arithmetic is avoided when twiddle factors contains 
zeros or ones. The vast majority of microprocessors, even those found in 
mobile devices such as phones, feature SIMD extensions, and so an 
approach that is limited to scalar arithmetic is of little consequence. Second, 
they benchmark the FFTs in a most unusual way. Rather than repeat a large 


number of iterations of the FFT, they repeat a large number of iterations of 
a binary that initializes and then executes only one FFT; such an approach is 
by no means representative of applications where the performance of the 
FFT is a concern, and is more a measurement of the initialization time 
rather than the FFT. 


Conclusions and Future Work 


The results presented in this thesis show that vectorization at the algorithm 
level of abstraction produces good performance results, the conjugate-pair 
algorithm is in many cases faster than the ordinary split-radix algorithm, 
and that there are good heuristics for predicting the performance of the FFT 
on SIMD microprocessors (i.e., the need for empirical optimization may be 
overstated). 


This work concludes with a review of the hypotheses, a summary of the 
contributions, some ideas for directions that future work might take, and a 
few final remarks. 


Revisiting the hypotheses 


This section discusses the hypotheses of Introduction with reference to the 
experiments in Implementation details and Streaming FFT and the results in 
Results and discussion. 


Hypothesis 1: Accessing memory in sequential “streams” is critical for 
best performance 


The simple implementation in Simple programs used a LUT to store 
precomputed coefficients, but for every size of sub-transform that composes 
a particular transform, the LUT is accessed non-contiguously, with vector 
gather operations of varying strides. In Vectorized loops, vector intrinsics 
and a sequentially accessed LUT for each size of sub-transform are shown 
to improve performance. Although the set of LUTs increases the memory 
footprint, the speed improves markedly, by over 30% in many cases. 


computation was topologically sorted so that accesses to the input data, 
which are effectively pseudo-random for a decimation-in-time 
decomposition, are ordered into sequential streams. Benchmark results in 
Results and discussion show that this technique, in tandem with several 
others, achieves good results, being faster than FFTW in many cases. 


The results from the above two cases confirm the idea that accessing data in 
sequential streams provides big performance gains, even in the somewhat 
counter-intuitive case where data is duplicated and more memory is 
required. 


Hypothesis 2: The conjugate-pair algorithm is faster than the ordinary 
split-radix algorithm 


Hypothesis 2 is based on the idea that memory bandwidth is a bottleneck, 
and on the fact that the conjugate-pair algorithm requires only half the 
number of twiddle factor loads. 


In Results and discussion, a highly optimized implementation of the 
conjugate-pair algorithm is benchmarked against an equally highly 
optimized implementation of the ordinary split-radix algorithm. For smaller 
sizes of transform, the ordinary split-radix algorithm is faster, but above a 
certain size (4096 in this case), the conjugate-pair algorithm is faster. 


Thus, Hypothesis 2 is confirmed with the proviso that the transform is 
larger than a particular size. 


Hypothesis 3: The performance of an FFT can be predicted based on 
characteristics of the underlying machine and the compiler 


In Results and discussion, SFFT and FFTW were benchmarked on sixteen 
x86 machines and two ARM NEON machines, and SFFT was found to be 
as fast as, or faster than FFTW, suggesting that the performance of an FFT 
running on a certain machine can be predicted and reasoned about, and that 
extensive machine calibration might not be required. 


In Predicting performance, a model was evaluated with 10-fold cross- 
validation to have 74.8% precision when using characteristics of the 
underlying machine and the compiler to predict performance, further 
supporting the idea that the performance of the FFT on SIMD 
microprocessors can be predicted and reasoned about. 


Contributions 
The contributions of this work are summarized as follows: 


1. Three methods of computing the conjugate-pair algorithm on SIMD 
microprocessors are presented in Streaming FFT. The three techniques 
are suited for different sizes of transform, but in general, all techniques 
are amenable to algorithm level vectorization, and latency and memory 
locality optimizations. These techniques are shown to produce results 
that are, in many cases, faster than state of the art libraries such as 
FFTW and SPIRAL, but without extensive machine calibration; 

2. The source code for the library developed in this thesis, SFFT, is 
publicly available under a permissive open source license on github. A 
permissive open source license will hopefully ensure that SFFT is 
developed further. 


Future work 


This section presents some ideas for future work that can be divided into 
four categories: measurement, modelling, systems and applications. 


Measurement 


FFTW could be instrumented to collect data on the huge space of 
transforms it evaluates, which could then be used to build more accurate 
models. 


The existing FFT benchmarking infrastructure could be improved by 
detecting interruption by other system processes and re-running the affected 
results. Benchmarks could then be run on a much wider range of machines, 
under more controlled conditions, which would increase the accuracy of 
models built from the data. 


Modelling 


It might be possible to build a classifier that predicts whether a transform is 
likely, given some threshold, to be the fastest. The fastest is then selected 
from a subset of those that are likely to be the fastest, and thus the number 
of transforms that must be evaluated during calibration is reduced, while 
sacrificing little or no performance. 


Systems 


SFFT could be extended to multi-dimensional, multi-threaded, real, large 
(megapoint and above) and arbitrary sized transforms. Additionally, support 
for other architectures such as POWER and Cell B.E. could be added. Code 
could be optimized between transforms in a library, which would reduce 
binary size, but there may be other effects. 


Algorithms 


So far, there have been no known attempts to seriously optimize the tangent 
FFT, and the results of optimizing the tangent FFT to the same degree as the 
conjugate-pair FFT in this thesis would be very interesting. 


SFFT could be utilized in the sparse FFT algorithms which have recently 
been published, perhaps improving their performance even further. 


Applications 


Applications such as the SETI@home client could be patched to support 
SFFT. The results of benchmarks between SFFT, FFTW and other libraries, 
when used in real world applications such as SETI@home, would be of 
great interest. 


Final remarks 


This thesis showed that high-performance computation of the FFT is by no 
means a solved problem, and it is hoped that this work will serve as a 
catalyst or foundation for future efforts that push efficiency and 
performance even further. 


Appendix 1 - Simple FFTs 


This Appendix contains source code listings corresponding to the simple 
implementations in Implementation details. 


#include <complex.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <math.h> 


typedef complex float data_t; 


#define W(N,k) (cexp(-2.0f * M_PI * I * (float)k / 
(float )N)) 


void ditfft2(data_t *in, data_t *out, int stride, 


int N) { 
if(N == 2) { 
out [0] = in[O] + in[stride]; 
out[N/2] = in[O] - in[stride]; 
telse{ 


ditfft2(in, out, stride << 1, N >> 1); 


ditfft2(intstride, out+N/2, stride << 1, N >> 1); 


/* k=0 - 
> no multiplication */ 

data_t Ek = out[0]; 
data_t Ok = out[N/2]; 
out[@] = Ek + Ok; 
out[N/2] = Ek - Ok; 

} 

int k; 


For(k=1;k<N/2;k++) { 
data_t Ek = out[k]; 


data_t Ok = out[(k+N/2)]; 


out [k] Ek + W(N,k) * Ok; 


out[(k+N/2) ] = Ek - W(N,k) * Ok; 
} 
} 
I; 
Simple radix-2 FFT 


#include <complex.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <math.h> 


typedef complex float data_t; 


#define W(N,k) (cexp(-2.0f * M_PI * I * (float)k / 
(float )N)) 


void splitfft(data_t *in, data_t *out, int stride, 
int N) { 
if(N == 1) £ 
out[O] = in[0]; 
selse if(N == 2) { 
out[0] = 
out[N/2] = 


in[O] + in[stride]; 

in[O] - in[stride]; 
telse{ 

splitfft(in, out, stride << 1, N >> 1); 

splitfft(in+stride, out+N/2, stride << 2, N >> 2); 


splitfft(in+3*stride, out+3*N/4, stride << 2, N 
>> 2); 


data_t Uk = out[0]; 
data_t Zk = out[O+N/2]; 
data_t Uk2 = out[0O+N/4]; 
data_t Zdk = out[0+3*N/4]; 
out[0] = Uk + (Zk + Zdk); 
out [0O+N/2] = Uk - (Zk + Zdk); 
out [O+N/4] = Uk2 - I* 
(Zk - Zdk); 
out[0+3*N/4] = Uk2 + I* 
(Zk - Zdk); 
ng 
int k; 
for(k=1;k<N/4;k++) { 
data_t Uk = out[k]; 
data_t Zk = out[k+N/2]; 
data_t Uk2 = out[k+N/4]; 
data_t Zdk = out[k+3*N/4]; 
out[k] = Uk + (W(N,k)*Zk + W(N, 3*k)*Zdk); 
out [k+N/2] = Uk - (W(N,k)*Zk + W(N, 3*k)*Zdk); 


out[k+N/4] | = Uk2 - I* 
(W(N,k)*Zk - W(N,3*k)*Zdk); 
out[k+3*N/4] = Uk2 + I* 
(W(N,k)*Zk - W(N,3*k)*Zdk); 
} 


I 
Simple split-radix FFT 


#include <complex.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <math.h> 


typedef complex float data_t; 


#define W(N,k) (cexp(-2.0f * M_PI * I * (float)k / 
(float )N)) 


void conjfft(data_t *base, int TN, 


data_t *in, data_t *out, int stride, int N) { 
if(N == 1) { 
if(in < base) in += TN; 
out[O] = in[0]; 
selse if(N == 2) { 


data_t *10 = in, *i1 = in + stride; 
if(i0 < base) 10 += TN; 
if(i1 < base) 11 += TN; 
out[0] = *10 + *i1; 
out[N/2] *10 - *1i1; 


telse{ 
conjfft(base, TN, in, out, stride << 1, N >> 1); 


conjfft(base, TN, int+tstride, out+N/2, stride << 2, 
N >> 2); 
conjfft(base, TN, in- 
stride, out+3*N/4, stride << 2, N >> 2); 


{ 
data_t Uk = out[0]; 
data_t Zk = out[O+N/2]; 
data_t Uk2 = out[0O+N/4]; 
data_t Zdk = out[0+3*N/4]; 
out [0] = Uk + (Zk + Zdk); 
out [O+N/2] = Uk - (Zk + Zdk); 


out[O+N/4] = Uk2 - I* 


(Zk - Zdk); 


out[@+3*N/4] = Uk2 + I* 


(Zk - Zdk); 
} 
int k; 
For(k=1;k<N/4;k++) { 
data_t Uk = out[k]; 
data_t Zk = out[k+N/2]; 
data_t Uk2 = out[k+N/4]; 
data_t Zdk = out[k+3*N/4]; 
data_t w = W(N,k); 
out[k] = Uk + (w*Zk + conj(w)*Zdk); 
out [k+N/2] = Uk - (w*Zk + conj(w)*Zdk); 
out [k+N/4] = Uk2 - I* 


(w*Zk - conj(w)*Zdk); 


out [k+3*N/4] = Uk2 + I* 


(w*Zk - conj(w)*Zdk); 
i; 
} 
} 
Simple conjugate-pair FFT 


#include <stdio.h> 
#include <math.h> 
#include <stdlib.h> 
#include <complex.h> 


typedef complex float data_t; 


#define W(N,k) (cexp(-2.0f * M_PI * I * (float) 


(k) 7 (float)(N))) 


float 
S(int n, int k) { 
if (n <= 4) return 1.0f; 


int k4 = k % (n/4); 


if (k4 <= n/8) return (s(n/4,k4) * cosf(2.0f * M 
_PI * (float)k4 / (float)n)); 
return (s(n/4,k4) * sinf(2.0f * M_PI * (float)k4 
/ (float)n)); 
} 


void tangentfft8(data_t *base, int TN, data_t “*in, 
data_t *out, int stride, int N) { 
1if(N == 1) { 
if(in < base) in += TN; 
out[O] = in[0]; 
selse if(N == 2) { 


data_t *10 = in, *i1 = in + stride; 
if(i0 < base) 10 += TN; 
if(i1 < base) 11 += TN; 
out [0] = *10 + *i1; 
out[N/2] = *10 - *1i1; 
selse if(N == 
tangentfft8(base, TN, in, out, stride << 1, N 
>> 1); 
tangentfft8(base, TN, int+tstride, out+2, stride 
<< 1, N >> 1); 


data_t tempi = out[0O] + out[2]; 
data_t temp2 = out[0O] - out[2]; 
out[O] = tempi; 


| 
ct 
i?) 
3 
x2) 
NO 


out [2] 

tempi = out[1] - I*out[3]; 
temp2 = out[1] + I*out[3]; 
out[1] = temp1; 

out[3] = temp2; 


telse{ 


tangentfft8(base, TN, in, out, stride << 
>> 2); 
tangentfft8(base, TN, int 
(stride*2), out+2*N/8, stride << 3, N >> 3); 
tangentfft8(base, TN, in- 
(stride*2), out+3*N/8, stride << 3, N >> 3); 
tangentfft8(base, TN, int 
(stride), out+4*N/8, stride << 2, N >> 2); 
tangentfft8(base, TN, in- 
(stride), out+6*N/8, stride << 2, N >> 2); 
int k; 
for(k=0;k<N/8;k++) { 
float s4 = s(N/4,k)/s(N,k); 
float s4_n8 = s(N/4,kK+N/8)/s(N,kK+N/8); 


float s2 = s(N/2,k)/sS(N,k); 


float s2_n8 = s(N/2,kK+N/8)/S(N, K+N/8); 


data_t wO = W(N,k)*s4; 


data_t wi = W(N,k+N/8)*s4_n8; 

data_t w2 = W(N,2*k)*s(N/8,k)/S(N/2,k); 
data_t zk_p = w0 * out[k+4*N/8]; 
data_t zk_n = conj(w0) * out[k+6*N/8]; 
data_t zk2_p = wil * out[k+5*N/8]; 
data_t zk2_n = conj(wi1) * out[k+7*N/8]; 
data_t uk = out[k] * s4 
data_t uk2 = out[k+N/8] * s4 

_n8; 
data_t yk_p = w2 * out[k+2*N/8]; 
data_t yk_n = conj(w2) * out[k+3*N/8]; 


data_t yO = (yk_p + yk_n)*s2; 


data_t yi = (yk_p - yk_n)*I*s2_n2; 


out[k ] = uk + yO + (zk_p + zk_n); 
out[k+4*N/8] = uk + yO - (zk_p + zk_n); 
out[k+2*N/8] = uk - yO - I*(zk_p - zk_n); 
out[k+6*N/8] = uk - yO + I*(zk_p - zk_n); 
out[k+1*N/8] = uk2 - yi + (zk2_p + zk2_n); 
out[k+3*N/8] = uk2 + yi - I*(zk2_p - zk2_n); 
out[k+5*N/8] = uk2 - yi - (zk2_p + zk2_n); 
out[k+7*N/8] = uk2 + y1 + I*(zk2_p - zk2_n); 
} 


} 


void tangentfft4(data_t *base, int TN, 


data_t *in, data_t *out, int stride, int N) { 
if(N == 1) { 
if(in < base) in += TN; 
out[O] = in[0]; 
selse if(N == 2) { 


data_t *10 = in, *1i1 = in + stride; 
if(i0 < base) 10 += TN; 
if(i1 < base) 11 += TN; 
out[0] = *10 + *i1; 
out [N/2] *10 - *i1; 


telse{ 


tangentfft4(base, TN, in, out, stride << 1, N >> 1 
); 


tangentfft8(base, TN, in+stride, out+N/2, stride < 


<2, N >> 2); 
tangentfft8(base, TN, in- 
stride, out+3*N/4, stride << 2, N >> 2); 


{ 
data_t Uk = out[0]; 
data_t Zk = out[O+N/2]; 
data_t Uk2 = out[0O+N/4]; 
data_t Zdk = out[0+3*N/4]; 
out[0] = Uk + (Zk + Zdk); 
out [O+N/2] = Uk - (Zk + Zdk); 
out [O+N/4] = Uk2 - I* 
(Zk - Zdk); 
out[0+3*N/4] = Uk2 + I* 
(Zk - Zdk); 
int k; 
For(k=1;k<N/4;k++) { 
data_t Uk = out[k]; 
data_t Zk = out[kK+N/2]; 
data_t Uk2 = out[k+N/4]; 
data_t Zdk = out[k+3*N/4]; 


data_t w = W(N,k)*S(N/4,k); 


out[k ] = Uk + (w*Zk + conj(w)*Zdk); 


out [k+N/2] Uk - (w*Zk + conj(w)*Zdk); 
out[k+N/4] = Uk2 - I* 
(w*Zk - conj(w)*Zdk); 
out[k+3*N/4] = Uk2 + I* 
(w*Zk - conj(w)*Zdk); 
} 


} 


} 
Simple tangent FFT 


Appendix 2 - FFTs with precomputed LUTs 


This Appendix contains source code listings corresponding to the FFT 
implementations with precomputed coefficients in Implementation details. 


#include <math.h> 
#include <complex.h> 
#include <stdio.h> 
#include <stdlib.h> 


typedef complex float data_t; 


#define W(N,k) (cexp(-2.0f * M_PI * I * (float)k / 
(float )N)) 


data_t *LUT; 


void ditfft2(data_t *in, data_t *out, int log2stri 
de, int stride, int N) { 


if(N == 2) { 
out[0] = in[O] + in[stride]; 
out[N/2] = in[O] - in[stride]; 
telse{ 


ditfft2(in, out, log2stridet+1i, stride << 1, N >> 1 
); 


ditfft2(intstride, out+N/2, log2stridet+1, stride < 
<1, N >> 1); 


{ /* k=0 - 
> no multiplication */ 
data_t Ek = out[0]; 
data_t Ok = out[N/2]; 
out[0] = Ek + Ok; 
out[N/2] = Ek - Ok; 


int k; 
For(k=1;k<N/2;k++) { 
data_t Ek 
data_t Ok 


out[k]; 
out[(k+N/2)]; 


data_t w = LUT[k<<log2stride]; 
out [k] = Ek + w * Ok; 


out[(k+N/2) ] = Ek - w * Ok; 
ii 
} 
} 


void fft_init(int N) { 
LUT = malloc(N/2 * sizeof(data_t)); 
int 1; 
for(1=0;1<N/2;1++) LUT[i] = W(N,1); 


Simple radix-2 FFT with precomputed LUT 


#include <complex.h> 
#include <stdio.h> 
#include <stdlib.h> 


typedef complex float data_t; 


#define W(N,k) (cexp(-2.0f * M_PI * I * (float)k / 
(float )N)) 

data_t *LUT1; 

data_t *LUT3; 


void splitfft(data_t *in, data_t *out, int log2str 
ide, int stride, int N) { 
1if(N == 1) { 
out[O] = in[0]; 


out[0] 
out[N/2] 


in[O] + in[stride]; 


selse if(N == 2) { 
- in[O] - in[stride]; 


selse{ 


splitfft(in, out, log2stride+i, stride << 1, N >> 
1); 


splitfft(in+stride, out+N/2, log2stride+2, stride 
<< 2, N >> 2)} 


splitfft(in+3*stride, out+3*N/4, log2stridet+2, s 
tride << 2, N >> 2); 


4 
data_t Uk = out[0]; 
data_t Zk = out[O+N/2]; 
data_t Uk2 = out[0O+N/4]; 
data_t Zdk = out[0+3*N/4]; 
out[0] = Uk + (Zk + Zdk); 
out [O+N/2] = Uk - (Zk + Zdk); 
out [O+N/4] = Uk2 - I* 
(Zk - Zdk); 
out[0+3*N/4] = Uk2 + I* 
(Zk - Zdk); 
int k; 
for(k=1;k<N/4;k++) { 
data_t Uk = out[k]; 
data_t Zk = out[kK+N/2]; 
data_t Uk2 = out[k+N/4]; 
data_t Zdk = out[k+3*N/4]; 


data_t wi = LUT1[k<<log2stride] ; 


data_t w3 = LUT3[k<<log2stride] ; 


out[k] = Uk + (wi1*Zk + w3*Zdk),; 
out [k+N/2 | = Uk - (wi*Zk + w3*Zdk); 
out[k+N/4] = Uk2 - I* 


(wi*Zk - w3*Zdk); 
out[k+3*N/4] = Uk2 + I* 
(wi*Zk - w3*Zdk); 


} 
} 
} 


void fft_init(int N) { 
LUT1 = malloc(N/4 * sizeof(data_t)); 
LUT3 = malloc(N/4 * sizeof(data_t)); 
int 1; 
for(1=0;1<N/4;1++) LUT1[1] 
for(1=0;1<N/4;1++) LUT3[1] 


W(N,1); 
W(N, 3*i); 


Simple split-radix FFT with precomputed LUT 


#include <math.h> 
#include <complex.h> 
#include <stdio.h> 
#include <stdlib.h> 


typedef complex float data_t; 

#define W(N,k) (cexp(-2.0f * M_PI * I * (float)k / 
(float )N)) 

data_t *LUT; 

void conjfft(data_t *base, int TN, 


data_t *in, data_t *out, int log2stride, int strid 
e, int N) { 


if(N == 1) { 
if(in < base) in += TN; 
out[O] = in[0]; 

selse if(N == 2) { 


data_t *10 = in, *i1 = in + stride; 
if(i0 < base) 10 += TN; 
if(ii1 < base) 11 += TN; 
out[0] = 1.0) 4° a 
out[N/2] *10 - *1i1; 


telse{ 


conjfft(base, TN, in, out, log2stridet+1, stride << 
1, N >> 1); 


conjfft(base, TN, int+stride, out+N/2, log2stridet+2 
, stride << 2, N >> 2); 

conjfft(base, TN, in- 
stride, out+3*N/4, log2stride+2, stride << 2, N >> 
2); 


{ 

data_t Uk = out[0]; 

data_t Zk = out[O+N/2]; 

data_t Uk2 = out[0O+N/4]; 

data_t Zdk = out[0+3*N/4]; 
out [0] = Uk + (Zk + Zdk); 
out [O+N/2] = Uk - (Zk + Zdk); 

out [O+N/4] = Uk2 - I* 
(Zk - Zdk); 

out[0+3*N/4] = Uk2 + I* 
(Zk - Zdk); 


int k; 
for(k=1;k<N/4;k++) { 


data_t Uk = out[k]; 
data_t Zk = out[k+N/2]; 
data_t Uk2 = out[k+N/4]; 
data_t Zdk = out[k+3*N/4]; 


data_t w = LUT[k<<log2stride]; 


out[k ] = Uk + (w*Zk + conj(w)*Zdk); 
out [k+N/2] = Uk - (w*Zk + conj(w)*Zdk); 
out [k+N/4] = Uk2 - I* 


(w*Zk - conj(w)*Zdk); 
out[k+3*N/4] = Uk2 + I* 
(w*Zk - conj(w)*Zdk); 
} 


} 
} 


void fft_init(int N) { 
LUT = malloc(N/4 * sizeof(data_t)); 
int 1; 
for (i=0;i<N/4;i+t+) LUT[i] = W(N,i); 

} 

Simple conjugate-pair FFT with precomputed LUT 


#include <math.h> 
#include <complex.h> 
#include <stdio.h> 
#include <stdlib.h> 


typedef complex float data_t; 


#define W(N,k) (cexp(-2.0f * M_PI * I * (float) 
(k) 7 (float)(N))) 


float 
Ss(int n, int k) { 


if (n <= 4) return 1.0f; 
int k4 = k % (n/4); 
if (k4 <= n/8) 


return (s(n/4,k4) * cosf(2.0f * M_PI * (float)k4 / 
(float)n)); 
return (s(n/4,k4) * sinf(2.0f * M_PI * (float)k4 
/ (float)n)); 
} 


data_t *LUT, *LUTO, *LUT1, *LUT2; 
float *s2, *s4; 


void tangentfft8(data_t *base, int TN, data_t “in, 
data_t *out, int log2stride, 
int stride, int N) { 
if(N == 1) { 
if(in < base) in += TN; 
out[O] = in[0]; 
selse if(N == 2) { 


data_t *10 = in, *1i1 = in + stride; 
if(i0 < base) 10 += TN; 
if(ii1 < base) 11 += TN; 
out[0] = *10 + *1i1° 
out[N/2] = *10 - *1i1; 
selse if(N == 4) { 
tangentfft8(base, TN, in, out, log2strideti, s 
tride << 1, N >> 1); 
tangentfft8(base, TN, int+stride, out+2, log2st 
ride+i, stride << 1, N >> 1); 


data_t tempi = out[0] + out[2]; 
data_t temp2 = out[0] - out[2]; 
out[O] = tempi; 


out[2] = temp2; 


temp1 = out[1] - I*out[3]; 
temp2 = out[1] + I*out[3]; 
out[1] = temp1; 
out[3] = temp2; 

telse{ 


tangentfft8(base, TN, in, out, log2stridet+2, s 
tride << 2, N >> 2); 

tangentfft8(base, TN, int 
(stride*2), out+2*N/8, log2stride+3, stride << 3, 
N >> 3); 

tangentfft8(base, TN, in- 
(stride*2), out+3*N/8, log2stride+3, stride << 3, 
N >> 3); 

tangentfft8(base, TN, int 
(stride), out+4*N/8, log2stride+2, stride << 2, N 
>> 2); 

tangentfft8(base, TN, in- 
(stride), out+6*N/8, log2stride+2, stride << 2, N 
>> 2); 

int k; 
for (k=0;k<N/8; k++) { 


data_t wO = LUTO[k<<log2stride]; 


data_t wil LUT1[k<<log2stride]; 


data_t w2 = LUT2[k<<log2stride] ; 


data_t zk_p = w0 * out[k+4*N/8]; 
data_t zk_n = conj(w0) * out[k+6*N/8]; 
data_t zk2_p = wl * out[k+5*N/8]; 
data_t zk2_n = conj(wi) * out[k+7*N/8]; 
data_t uk = out[k] * s4 


[ k<<log2stride]; 
data_t uk2 = out[k+N/8] * s4 


[k+N/8 << log2stride]; 
data_t yk_p = w2 


= * out[k+2*N/8]; 
data_t yk_n = conj(w2) * out[k+3*N/8]; 


data_t yO = (yk_p + yk_n)*s2[k<<log2stride]; 


data_t yi = (yk_p - yk_n)*I*s2[k+N/8 << log2stride 


]; 
out[k] = uk + yO 
out[k+4*N/8] = uk + yO 
out[k+2*N/8] = uk - yO 
out[k+6*N/8] = uk - yO 
out[k+1*N/8] = uk2 - y1 
out[k+3*N/8] = uk2 + y1 
out[k+5*N/8] = uk2 - y1 
out[k+7*N/8] = uk2 + y1 

} 
} 
} 


+ 


+ 


(zk_p + zk_n); 
(zk_p + zk_n); 
zk_n); 
zk_n); 


I*(zk_p - 
I*(zk_p - 
(zk2_p 
I*(zk2_p 
(zk2_p 
I*(zk2_p 


+ 


+ 


zk2_n); 
zk2_n); 
zk2_n); 
zk2_n); 


void tangentfft4(data_t *base, int TN, data_t “in, 


data_t *out, int log2stride, 


int stride, int N) { 


if(N == 1) { 


if(in < base) in += TN; 
out[O] = in[0]; 


selse if(N == 2) { 


data_t *10 = in, *1i1 = in + stride; 
if(i0 < base) 10 += TN; 
if(ii1 < base) 11 += TN; 
out[0] = *10 + *i1; 
out[N/2] = *10 - *1i1; 


telse{ 


tangentfft4(base, TN, in, out, log2stridet+1, strid 
e << 1, N >> 1); 


tangentfft8(base, TN, intstride, out+N/2, log2stri 
de+2, stride << 2, N >> 2); 

tangentfft8(base, TN, in- 
stride, out+3*N/4, log2stride+2, stride << 2, N >> 
2); 


1 
data_t Uk = out[0]; 
data_t Zk = out[O+N/2]; 
data_t Uk2 = out[0O+N/4]; 
data_t Zdk = out[0+3*N/4]; 
out[0] = Uk + (Zk + Zdk); 
out [O+N/2] = Uk - (Zk + Zdk); 
out [O+N/4] = Uk2 - I* 
(Zk - Zdk); 
out[0+3*N/4] = Uk2 + I* 
(Zk - Zdk); 
int k; 
for(k=1;k<N/4;k++) { 
data_t Uk = out[k]; 
data_t Zk = out[k+N/2]; 
data_t Uk2 = out[k+N/4]; 
data_t Zdk = out[k+3*N/4]; 


data_t w = LUT[k<<log2stride]; 
out[k] = Uk + (w*Zk + conj(w)*Zdk); 


out [k+N/2] Uk - (w*Zk + conj(w)*Zdk); 


out[k+N/4] | = Uk2 - I* 


(w*Zk - conj(w)*Zdk); 
out[k+3*N/4] = Uk2 + I* 
(w*Zk - conj(w)*Zdk); 
} 


} 
} 


void fft_init(int N) { 
LUTO = malloc(N/8 * sizeof(data_t)); 
LUT1 malloc(N/8 * sizeof(data_t)); 
LUT2 malloc(N/8 * sizeof(data_t)); 
LUT = malloc(N/4 * sizeof(data_t)); 


s2 = malloc(N/4 * sizeof(float)); 
s4 = malloc(N/4 * sizeof(float)); 
int 1; 


for (1=0;1<N/8;i++) LUTO[i] 
)i 


for(1=0;1<N/8;1i++) LUT1I[1] = W(N,1i+N/8)*S(N/4, 1+N/ 
8)/S(N,1+N/8); 


W(N, i)*S(N/4,i)/S(N,i 


W(N, 2*i)*S(N/8,i)/S(N 


for (1=0;1<N/8;1i++) LUT2[i] 
ey ae i 


for(i=0;i<N/4;i++) LUT[i] = W(N,i)*s(N/4,i); 
for(i=0;i<N/4;i++) s4[i] = s(N/4,i)/s(N,i); 
for(i=0;i<N/4;i++) s2[i] = s(N/2,i)/s(N,i); 


t 
Simple tangent FFT with precomputed LUT 


Appendix 3 - FFTs with vectorized loops 


This Appendix contains source code listings corresponding to the 
vectorized FFT implementations in Implementation details. 


#include <math.h> 
#include <complex.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <xmmintrin.h> 


typedef complex float data_t; 


#define W(N,k) (cexp(-2.0f * M_PI * I * (float) 
(k) 7 (float)(N))) 


data_t **LUT; 


void ditfft2(data_t *in, data_t *out, int log2stri 
de, int stride, int N) { 


if(N == 2) £ 
out[0] = in[O] + in[stride]; 
out[N/2] = in[O] - in[stride]; 


selse if(N == 4){ 


ditfft2(in, out, log2stridet+1i, stride << 1, N >> 1 
); 

ditfft2(intstride, out+N/2, log2stridet+1, stride < 
<1, N >> 1); 


data_t tempo 
data_t temp1 


out[O] + out[2]; 
out [0] out[2]; 
data_t temp2 out[1] T*out[3]; 
data_t temp3 out[1] + I*out[3]; 
if(log2stride) { 


out[O] = creal(tempO) + creal(temp2)*I; 
out[1] = creal(tempi1) + creal(temp3)*I; 
out[2] = cimag(tempO) + cimag(temp2)*I; 


out[3] = cimag(temp1) + cimag(temp3)*I; 


telse{ 
out[O] = tempo; 
out[2] = tempi; 
out[1] = temp2; 
out[3] = temp3; 


ig 
selse if(!log2stride) { 


ditfft2(in, out, log2stridet+1i, stride << 1, N >> 1 
); 


ditfft2(int+stride, out+N/2, log2stridet+1, stride < 
<1, N >> 1); 


int k; 
for (k=0;k<N/2;k+=4) { 


__mi28 Ok_re = _mm_load_ps((float *)&out[k+N/2]); 
__mi28 Ok_im = _mm_load_ps((float *)&out[k+N/2+2] ) 
__mi28 w_re = _mm_load_ps((float *)&LUT[log2stride 
JLk]); 


__mi28 w_im = _mm_load_ps((float *)&LUT[log2stride 
][k+2]); 


__mi28 Ek_re = _mm_load_ps((float *)&out[k]); 


__mi28 Ek_im = _mm_load_ps((float *)&out[k+2]); 


__mi28 wOk_re = _mm_sub_ps(_mm_mul_ps(Ok_re,w_re), 
_mm_mul_ps(Ok_im,w_im)); 


__mi28 wOk_im = _mm_add_ps(_mm_mul_ps(Ok_re,w_im), 
_mm_mul_ps(Ok_im,w_re)); 


__mi28 outO_re = _mm_add_ps(Ek_re, wOk_re); 
__mi28 outO_im = _mm_add_ps(Ek_im, wOk_im); 
__mi28 outi_re = _mm_sub_ps(Ek_re, wOk_re); 


__mi28 outi_im = _mm_sub_ps(Ek_im, wOk_im); 
_mm_store_ps((float *) 
(out+k), _mm_unpacklo_ps(outO_re, outO0_im)); 
_mm_store_ps((float *) 
(out+k+2), _mm_unpackhi_ps(outO_re, outO_im)); 
_mm_store_ps((float *) 
(out+k+N/2), _mm_unpacklo_ps(outi_re, outi_im)); 
_mm_store_ps((float *) 
(out+k+N/2+2), _mm_unpackhi_ps(out1_re, outi_im)); 


selse{ 


ditfft2(in, out, log2stridet+1, stride << 1, N >> 1 
); 


ditfft2(int+stride, out+N/2, log2stridet+1, stride < 
<1, N >> 1); 


int k; 
for (k=0;k<N/2;k+=4) { 


__mi28 Ok_re = _mm_load_ps((float *)&out[k+N/2]); 


__mi28 Ok_im = _mm_load_ps((float *)&out[k+N/2+2] ) 


y 


__mi28 w_re = _mm_load_ps((float *)&LUT[log2stride 
JLk]); 


__mi28 w_im = _mm_load_ps((float *)&LUT[log2stride 
][k+2]); 


__mi28 Ek_re = _mm_load_ps((float *)&out[k]); 
__mi28 Ek_im = _mm_load_ps((float *)&out[k+2]); 


__mi28 wOk_re = _mm_sub_ps(_mm_mul_ps(Ok_re,w_re), 
_mm_mul_ps(Ok_im,w_im)); 


__mi28 wOk_im = _mm_add_ps(_mm_mul_ps(Ok_re,w_im), 
_mm_mul_ps(Ok_im,w_re)); 

_mm_store_ps((float *) 
(out+k), _mm_add_ps(Ek_re, wOk_re)); 

_mm_store_ps((float *) 
(out+k+2), _mm_add_ps(Ek_im, wOk_im)); 

_mm_store_ps((float *) 
(out+k+N/2), _mm_sub_ps(Ek_re, wOk_re)); 

_mm_store_ps((float *) 
(out+k+N/2+2), _mm_sub_ps(Ek_im, wOk_im)); 

} 


} 
} 


void fft_init(int N) { 
int 1; 


#define log2(x) ((aint)(log(x)/log(2))) 


int n_luts = log2(N)-2; 
LUT = malloc(n_luts * sizeof(data_t *)); 


for(i=0;1i<n_luts;i++) { 
int n=N / pow(2,1); 


LUT[i] = _mm_malloc(n/2 * sizeof(data_t), 16); 
Ant: J} 
for (j=0;jJ<n/2;j+=4) { 
data_t w[4]; 
int k; 


for (k=0;k<4;k++) w[k] = W(n,jtk); 


LUT[1i] 

[j] = creal(w[0]) + creal(w[1])*I; 
LUT[1i] 

[jt+1] = creal(w[2]) + creal(w[3])*I; 
LUT[1i] 

[jt+2] = cimag(w[0]) + cimag(w[1])*I; 
LUT[1i] 

[j+3] = cimag(w[2]) + cimag(w[3])*I; 

} 


} 


Radix-2 FFT with vectorized loops 


typedef struct _reg t { 
__mi28 re, im; 


} reg_t; 
static inline reg_t MUL(reg_t a, reg_t b) { 
reg_t r; 
r.re = _mm_sub_ps(_mm_mul_ps(a.re,b.re),_mm_mul_ps 


(a.im,b.im)); 


r.im = 


_mm_add_ps(_mm_mul_ps(a.re,b.im),_mm_mul_ps 
(a.im,b.re)); 


return r; 


static inline reg_t MULJ(reg_t a, reg_t b) { 
reg_t r; 


r.re = 


ram. = 


} 


static 


} 


static 


} 


static 


} 


static 


_mm_add_ps(_mm_mul_ps(a.re,b.re),_mm_mul_ps 
(a.im,b.im)); 


_mm_sub_ps(_mm_mul_ps(a.im,b.re),_mm_mul_ps 
(a.re,b.im)); 


return r; 


inline reg_t ADD(reg_t a, reg_t b) { 
reg_t r; 


r.re 
r.im 


_mm_add_ps(a.re,b.re); 
_mm_add_ps(a.im,b.im); 


return r; 


inline reg_t SUB(reg_t a, reg_t b) { 


reg _t r; 
r.re = _mm_sub_ps(a.re,b.re); 
r.im = _mm_sub_ps(a.im,b.im); 
return r,; 


inline reg_t ADD_I(reg_t a, reg_t b) { 


reg_t r; 
r.re = _mm_sub_ps(a.re,b.im); 
r.im = _mm_add_ps(a.im,b.re); 
return r; 


inline reg_t SUB_I(reg_t a, reg_t b) { 
reg_t r; 


r.re 
r.im 


_mm_add_ps(a.re,b.im); 
_mm_sub_ps(a.im,b.re); 


return r; 


} 

static inline reg_t LOAD(float “*a) { 
reg_t r; 
r.re = _mm_load_ps(a); 
r.im = _mm_load_ps(a+4); 
return r; 


static inline void STORE(float *a, reg_t r) { 
_mm_store_ps(a, r.re); 
_mm_store_ps(a+4, r.im); 


I 
static inline void STOREIL(float *a, reg_t r) { 
_mm_store_ps(a, _mm_unpacklo_ps(r.re, r.im)); 


_mm_store_ps(at+4, _mm_unpackhi_ps(r.re, r.im)); 
} 

Vectorized math functions for split-radix 
implementations 


#include <math.h> 
#include <complex.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <xmmintrin.h> 


typedef complex float data_t; 

#define W(N,k) (cexp(-2.0f * M_PI * I * (float) 
(k) 7 (float)(N))) 

data_t **LUT1; 

data_t **LUT3; 


#include "vecmath.h" 


void splitfft(data_t *in, data_t *out, 
int log2stride, int stride, int N) { 


if(N == 1) { 
out[O] = in[0]; 
selse if(N == 2) { 
out[O] = in[O] + in[stride]; 
out[1] = in[O] - in[stride]; 
selse if(N == 4) { 


splitfft(in, out, log2stride+i, stride << 1, N >> 
1); 


splitfft(int+stride, out+N/2, log2stride+2, stride 
<< 2, N >> 2); 


splitfft(in+3*stride, out+3*N/4, log2stridet+2, s 
tride << 2, N >> 2); 


data_t tempo 


out[@] + (out[2] + out[3]); 


data_t temp1 


out[O] - (out[2] + out[3]); 
data_t temp2 = out[1] - I* 
(out[2] - out[3]); 
data_t temp3 = out[1] + I* 
(out[2] - out[3]); 
if(log2stride) { 


out[O] = creal(tempO) + creal(temp2)*I; 
out[1] = creal(tempi1) + creal(temp3)*I; 
out[2] = cimag(tempO) + cimag(temp2)*I; 


out[3] = cimag(temp1) + cimag(temp3)*I; 


telse{ 


out[O] = tempo; 
out[2] = tempi; 
out[1] = temp2; 
out[3] = temp3; 


} 
selse if(N == 8) { 


splitfft(in, out, log2stride+i, stride << 1, N >> 
1); 


splitfft(in+stride, out+N/2, log2stride+2, stride 
<< 2, N >> 2)} 


splitfft(in+3*stride, out+3*N/4, log2stride+2, s 
tride << 2, N >> 2); 


data_t o[8]; 
data_t Uk = creal(out[0]) + creal(out[2])*I; 
data_t Zk = out[4]; 


data_t Uk2 = creal(out[1]) + creal(out[3])*I; 
data_t Zdk = out[6]; 


o[O] = Uk + (Zk + Zdk); 
o[4] = Uk - (Zk + Zdk); 
o[2] = Uk2 - I*(Zk - Zdk); 
o[6] = Uk2 + I*(Zk - Zdk); 


} 
{ 


data_t Uk = cimag(out[0]) + cCimag(out[2])*I; 
data_t Zk = out[5]; 


data_t Uk2 = cimag(out[1]) + cimag(out[3])*I; 
data_t Zdk = out[7]; 


data_t wi = LUT1[log2stride][1]; 


data_t w3 = LUT3[log2stride][1]; 


o[1] = Uk + (wi*Zk + w3*Zdk); 


o[5] = Uk - (wi*Zk + w3*Zdk) 
o[3] 


a 


Uk2 - I* 
(wi*Zk - w3*Zdk); 
o[7] = Uk2 + I* 
(wi*Zk - w3*Zdk); 
} 
if(log2stride) { 
out[O] = creal(o[0]) + creal(o[1])*I; 
out[1] = creal(o[2]) + creal(o[3])*I; 
out[2] = cimag(o[0]) + cimag(o[1])*I; 
out[3] = cimag(o[2]) + cimag(o[3])*I; 
out[4] = creal(o[4]) + creal(o[5])*I; 
out[5] = creal(o[6]) + creal(o[7])*I; 
out[6] = cimag(o[4]) + cimag(o[5])*I; 
out[7] = cimag(o[6]) + cimag(o[7])*I; 
telse{ 
int 1; 


for(i=0;i<8;i++) out[i] = offi]; 


I 
selse if(!log2stride) { 


splitfft(in, out, log2stride+i, stride << 1, N >> 
1); 


splitfft(in+stride, out+N/2, log2stride+2, stride 
<< 2, N >> 2); 


splitfft(in+3*stride, out+3*N/4, log2stride+2, s 
tride << 2, N >> 2); 
int k; 
for (k=0;k<N/4;k+=4) { 


reg_t Uk = LOAD((float *)&out[k]); 


reg_t Zk = LOAD((float *)&out[k+N/2]); 


reg_t Uk2 = LOAD((float *)&out[k+N/4]); 

reg_t Zdk = LOAD((float *)&out[k+3*N/4]); 

reg_t wi = LOAD((float *)&LUT1[log2stride]|[k]); 
reg_t w3 = LOAD((float *)&LUT3[log2stride]|[k]); 


reg_t w3Zdk = MUL(w3, Zdk); 
reg_t wiZk = MUL(wi, Zk); 


reg_t sum = ADD(w1iZk, w3Zdk); 


reg_t dif = SUB(w1Zk, w3Zdk); 


STOREIL((float *)&out[k], ADD(Uk, sum)); 


STOREIL((float *)&out[k+N/2], SUB(Uk, sum)); 
STOREIL((float *)&out[k+N/4], SUB_I(Uk2, dif)); 


STOREIL((float *)&out[k+3*N/4], ADD_I(Uk2, dif)); 
} 


telse{ 


splitfft(in, out, log2stride+i, stride << 1, N >> 
1); 


splitfft(int+stride, out+N/2, log2stride+2, stride 
<< 2, N >> 2); 


splitfft(in+3*stride, out+3*N/4, log2stride+2, s 
tride << 2, N >> 2); 


int k; 
For (k=0;k<N/4;k+=4) { 


reg_t Uk = LOAD((float *)&out[k]); 

reg_t Zk = LOAD((float *)&out[k+N/2]); 
reg_t Uk2 = LOAD((float *)&out[k+N/4]); 
reg_t Zdk = LOAD((float *)&out[k+3*N/4]); 


reg_t wi = LOAD((float *)&LUT1[log2stride][k]); 


reg_t w3 = LOAD((float *)&LUT3[log2stride][k]); 


reg_t w3Zdk = MUL(w3, Zdk); 
reg_t wiZk = MUL(wi, Zk); 


reg_t sum = ADD(w1iZk, w3Zdk); 


reg_t dif = SUB(w1izk, w3Zdk); 


STORE((float *)&out[k], ADD(Uk, sum)); 
STORE((float *)&out[k+N/2], SUB(Uk, sum)); 
STORE((float *)&out[kK+N/4], SUB_I(Uk2, dif)); 
STORE((float *)&out[k+3*N/4], ADD_I(Uk2, dif)); 


} 
} 


void fft_init(int N) { 
#define log2(x) ((int)(log(x)/1log(2) )) 


int n_luts = log2(N)-1; 


LUT1 = malloc(n_luts * sizeof(data_t *)); 
LUT3 = malloc(n_luts * sizeof(data_t *)); 
int 1; 


for(i=0;i<n_luts;i++) { 
int n=N / pow(2,1); 


LUT1[i] = _mm_malloc(n/4 * sizeof(data_t),16); 
LUT3[i] = _mm_malloc(n/4 * sizeof(data_t),16); 


if(n == 8) { 
int Jj; 
For (j=0;j<n/4; j++) { 
LUTA[i] 
[J] = W(n,j); 
LUT3[i] 


[J] = w(n,3*)); 


telse{ 


} 


int j; 
for(j=0;j<n/4;jt+=4) { 


data_t wi[4], w3[4]; 
int k; 


for(k=0;k<4;k++) wi[k] = W(n,jt+k); 


for (k=0;k<4;k++) w3[k] = W(n,3*(j+k)); 


LUT1[i] 
[Jj] = creal(wi[0]) + creal(wi[1])*I; 
LUTA1[i] 
[j+1] = creal(wi[2]) + creal(wi[3])*I; 
LUT1[i] 
[j+2] = cimag(wi[0]) + cimag(wi[1])*I; 
LUT1[i] 
[j+3] = cimag(wi[2]) + cimag(w1[3])*I,; 
LUT3[i] 
[Jj] = creal(w3[0]) + creal(w3[1])*I; 
LUT3[i] 
[j+1] = creal(w3[2]) + creal(w3[3])*I; 
LUT3[i] 
[j+2] = cimag(w3[0]) + cimag(w3[1])*I,; 
LUT3[i] 
[j+3] = cimag(w3[2]) + cimag(w3[3])*I; 
i 


} 


Split-radix FFT with vectorized loops 


#include <math.h> 
#include <complex.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <xmmintrin.h> 


typedef complex float data_t; 


#define W(N,k) (cexp(-2.0f * M_PI * I * (float) 
(k) 7 (float)(N))) 
data_t **LUT1; 


#include "vecmath.h" 


data_t *base; 
int TN; 


void conjfft(data_t *in, data_t *out, 
int log2stride, int stride, int N) { 


if(N == 1) { 
if(in < base) in += TN; 
out[O] = in[0]; 

selse if(N == 2) { 


data_t *10 = in, *1i1 = in + stride; 
if(i0 < base) 10 += TN; 
if(i1 < base) 11 += TN; 
out[0] = *10 + *1i1; 
out[N/2] = *10 - *11; 
selse if(N == 4) { 


conjfft(in, out, log2stridet+1, stride << 1, N >> 1 


); 


conjfft(int+stride, out+N/2, log2stridet+2, stride < 


< 2, N >> 2); 

conjfft(in- 
stride, out+3*N/4, log2stride+2, stride << 2, N >> 
2); 


data_t tempO = out[0] + (out[2] + out[3]); 
data_t temp1 = out[0] - (out[2] + out[3]); 
data_t temp2 = out[1] - I* 
(out[2] - out[3]); 
data_t temp3 = out[1] + I* 
(out[2] - out[3]); 
if(log2stride) { 
out[O] = creal(tempO) + creal(temp2)*I; 
out[1] = creal(tempi1) + creal(temp3)*I; 
out[2] = cimag(tempO) + cimag(temp2)*I; 


out[3] = cimag(temp1) + cimag(temp3)*I; 


telse{ 
out[O] = tempo; 
out[2] = tempi; 
out[1] = temp2; 
out[3] = temp3; 
} 


selse if(N == 8) { 


conjfft(in, out, log2stridet+1, stride << 1, N >> 1 


); 


conjfft(int+stride, out+N/2, log2stridet+2, stride < 
< 2, N >> 2); 
conjfft(in- 


stride, out+3*N/4, log2stride+2, stride << 2, N >> 
2); 


data_t o[8]; 
{ 


data_t Uk = creal(out[0]) + creal(out[2])*I; 
data_t Zk = out[4]; 


data_t Uk2 = creal(out[1]) + creal(out[3])*I; 
data_t Zdk = out[6]; 


o[O] = Uk + (Zk + Zdk); 
o[4] = Uk - (Zk + Zdk); 
o[2] = Uk2 - I*(Zk - Zdk); 
o[6] = Uk2 + I*(Zk - Zdk); 


t 

{ 
data_t Uk = cimag(out[0]) + Cimag(out[2])*I; 
data_t Zk = out[5]; 


data_t Uk2 = cimag(out[1]) + cimag(out[3])*I; 
data_t Zdk = out[7]; 


data_t wi = LUT1[log2stride][1]; 


o[1] 


Uk + (wi*Zk + conj(wi1)*Zdk); 


o[5] = Uk - (wt*Zk + conj(wt1)*Zdk); 
o[3] = Uk2 - I* 
(wi*Zk - conj(w1)*Zdk); 
o[7] = Uk2 + I* 
(wi*Zk - conj(w1)*Zdk); 


if(log2stride) { 


out[O] = creal(o[0]) + creal(o[1])*I; 
out[1] = creal(o[2]) + creal(o[3])*I; 
out[2] = cimag(o[O]) + cimag(o[1])*I; 
out[3] = cimag(o[2]) + cimag(o[3])*I; 
out[4] = creal(o[4]) + creal(o[5])*I; 
out[5] = creal(o[6]) + creal(o[7])*I; 
out[6] = cimag(o[4]) + cimag(o[5])*I; 
out[7] = cimag(o[6]) + cimag(o[7])*I; 
selse{ 
int 1; 
for(i=0;1<8;1i++) out[i] = o[1]; 
selse if(!log2stride) { 
conjfft(in, out, log2stridet+1, stride << 1, N >> 1 
); 
conjfft(int+stride, out+N/2, log2stridet+2, stride < 
< 2, N >> 2); 
conjfft(in- 
stride, out+3*N/4, log2stride+2, stride << 2, N >> 
2); 
int k; 
for (k=0;k<N/4;k+=4) { 


reg_t Uk = LOAD((float *)&out[k]); 


reg_t Zk = LOAD((float *)&out[k+N/2]); 


reg_t Uk2 LOAD((float *)&out[k+N/4]); 


reg_t Zdk LOAD( (float *)&out[k+3*N/4]); 


reg_t wi = LOAD((float *)&LUT1[log2stride]|[k]); 


reg_t w3Zdk = MULJ(Zdk, wi); 
reg_t wiZk = MUL(wi, Zk); 


reg_t sum = ADD(wiZk, w3Zdk); 


reg_t dif = SUB(w1izk, w3Zdk); 


STOREIL((float *)&out[k], ADD(Uk, sum)); 
STOREIL((float *)&out[k+N/2], SUB(Uk, sum)); 
STOREIL((float *)&out[k+N/4], SUB_I(Uk2, dif)); 


STOREIL((float *)&out[k+3*N/4], ADD_I(Uk2, dif)); 
} 


selse{ 


conjfft(in, out, log2stride+1, stride << 1, N >> 1 


3 


conjfft(intstride, out+N/2, log2stride+2, stride < 
227 Nee 28 
conjfft(in- 
stride, out+3*N/4, log2stride+2, stride << 2, N >> 
2); 


int k; 
for (k=0;k<N/4;k+=4) { 


reg_t Uk = LOAD((float *)&out[k]); 

reg_t Zk = LOAD((float *)&out[k+N/2]); 
reg_t Uk2 = LOAD((float *)&out[k+N/4]); 
reg_t Zdk = LOAD((float *)&out[k+3*N/4]); 


reg_t wi = LOAD((float *)&LUT1[log2stride]|[k]); 


reg_t w3Zdk = MULJ(Zdk, wi); 
reg_t wiZk = MUL(wi, Zk); 


reg_t sum = ADD(wiZk, w3Zdk); 


reg_t dif = SUB(w1izk, w3Zdk); 


STORE((float *)&out[k], ADD(Uk, sum)); 
STORE((float *)&out[k+N/2], SUB(Uk, sum)); 
STORE((float *)&out[k+N/4], SUB_I(Uk2, dif)); 
STORE( (float ane ADD_I(Uk2, dif)); 


} 
} 


void fft_init(int N) { 
#define log2(x) ((int)(1log(x)/1log(2) )) 
int n_luts = log2(N)-1; 
LUT1 = malloc(n_luts * sizeof(data_t *)); 


int 1; 
for(i=0;1i<n_luts;i++) { 
int n=N / pow(2,1); 


LUT1[i] = _mm_malloc(n/4 * sizeof(data_t),16); 


if(n == 8) { 
int Jj; 
For (j=0;j<n/4; j++) { 
LUT1[i] 
[J] = wn,j); ; 
telse{ 
int Jj; 


for (j=0;j<n/4;j+=4) { 
data_t wi[4]; 
int k; 


for (k=0;k<4;k++) wi[k] = W(n,jt+k); 


LUT1[i] 
[j] = creal(wi[0]) + creal(wi[1])*I; 
LUT1[i] 
[j+1] = creal(wi[2]) + creal(w1[3])*I; 
LUT1[i] 
[j+2] = cimag(w1i[0]) + cimag(wi[1])*I; 
LUT1[i] 
[j+3] = cimag(wi[2]) + cimag(w1[3])*I; 
} 
t 
} 
TN = N; 


i 


Conjugate-pair FFT with vectorized loops 


